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Preface 


This book presents in detail methodologies for the Bayesian estimation of single- 
regime and regime-switching GARCH models. These models are widespread 
and essential tools in financial econometrics and have, until recently, mainly 
been estimated using the classical Maximum Likelihood technique. As this study 
aims to demonstrate, the Bayesian approach offers an attractive alternative 
which enables small sample results, robust estimation, model discrimination 
and probabilistic statements on nonlinear functions of the model parameters. 

The author is indebted to numerous individuals for help in the preparation 
of this study. Primarily, I owe a great debt to Prof. Dr. Philippe J. Deschamps 
who inspired me to study Bayesian econometrics, suggested the subject, guided 
me under his supervision and encouraged my research. I would also like to thank 
Prof. Dr. Martin Wallmeier and my colleagues of the Department of Quantitative 
Economics, in particular Michael Beer, Roberto Cerratti and Gilles Kaltenrieder, 
for their useful comments and discussions. 

Iam very indebted to my friends Carlos Ordas Criado, Julien A. Straubhaar, 
Jérome Ph. A. Taillard and Mathieu Vuilleumier, for their support in the fields of 
economics, mathematics and statistics. Thanks also to my friend Kevin Barnes 
who helped with my English in this work. 

Finally, I am greatly indebted to my parents and grandparents for their 
support and encouragement while I was struggling with the writing of this thesis. 
Thanks also to Margaret for her support some years ago. Last but not least, 
thanks to you Sophie for your love which puts equilibrium in my life. 
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Summary 


This book presents in detail methodologies for the Bayesian estimation of single- 
regime and regime-switching GARCH models. Our sampling schemes have the 
advantage of being fully automatic and thus avoid the time-consuming and 
difficult task of tuning a sampling algorithm. The study proposes empirical ap- 
plications to real data sets and illustrates probabilistic statements on nonlinear 
functions of the model parameters made possible under the Bayesian framework. 

The first two chapters introduce the work and give a short overview of the 
Bayesian paradigm for inference. The next three chapters describe the estima- 
tion of the GARCH model with Normal innovations and the linear regression 
models with conditionally Normal and Student-t-GJR errors. For these mod- 
els, we compare the Bayesian and Maximum Likelihood approaches based on 
real financial data. In particular, we document that even for fairly large data 
sets, the parameter estimates and confidence intervals are different between the 
methods. Caution is therefore in order when applying asymptotic justifications 
for this class of models. The sixth chapter presents some financial applications of 
the Bayesian estimation of GARCH models. We show how agents facing different 
risk perspectives can select their optimal VaR point estimate and document that 
the differences between individuals can be substantial in terms of regulatory cap- 
ital. Finally, the last chapter proposes the estimation of the Markov-switching 
GJR model. An empirical application documents the in- and out-of-sample su- 
periority of the regime-switching specification compared to single-regime GJR 
models. We propose a methodology to depict the density of the one-day ahead 
VaR and document how specific forecasters’ risk perspectives can lead to differ- 
ent conclusions on the forecasting performance of the MS-GJR model. 


JEL Classification: C11, C13, C15, C16, C22, C51, C52, C53. 


Keywords and phrases: Bayesian, MCMC, GARCH, GJR, Markov-switching, 
Value at Risk, Expected Shortfall, Bayes factor, DIC. 
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Introduction 


(...) “skedasticity refers to the volatility or wiggle of a 
time series. Heteroskedastic means that the wiggle itself 
tends to wiggle. Conditional means the wiggle of the 
wiggle depends on its own past wiggle. Generalized means 
that the wiggle of the wiggle can depend on its own past 
wiggle in all kinds of wiggledy ways.” 


— Kent Osband 


Volatility plays a central role in empirical finance and financial risk management 
and lies at the heart of any model for pricing derivative securities. Research on 
changing volatility (i.e., conditional variance) using time series models has been 
active since the creation of the original ARCH (AutoRegressive Conditional 
Heteroscedasticity) model in 1982. From there, ARCH models grew rapidly into 
a rich family of empirical models for volatility forecasting during the last twenty 
years. They are now widespread and essential tools in financial econometrics. 
In the ARCH(gq) specification originally introduced by Engle [1982], the con- 
ditional variance at time t, denoted by hz, is postulated to be a linear function 


of the squares of past q observations {Yy¢_1, Yr—2,---,Ys—q}- More precisely: 
q 
hy = a+ >> ave; (1.1) 
i=1 
where the parameters ap > 0 and a; > 0 (¢ = 1,...,q) in order to ensure a pos- 


itive conditional variance. In many of the applications with the ARCH model, 
a long lag length and therefore a large number of parameters are called for. 
To circumvent this problem, Bollerslev [1986] proposed the Generalized ARCH, 
or GARCH(p, g), model which extends the specification of the conditional vari- 
ance (1.1) as follows: 


q P 
hy = a0 + y nye; + Dy Bjhi—j 
i=1 j=l 
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where ap > 0, a; > 0 (i = 1,...,q) and 8; > 0 (j = 1,...,p). In this case, 
the conditional variance depends on its past values which renders the model 
more parsimonious. Indeed, in most empirical applications it turns out that the 
simple specification p = q = 1 is able to reproduce the volatility dynamics of 
financial data. This has led the GARCH(1, 1) model to become the “workhorse 
model” by both academics and practitioners. 

Numerous extensions and refinements of the GARCH model have been pro- 
posed to mimic additional stylized facts observed in financial markets. These 
extensions recognize that there may be important nonlinearity, asymmetry, and 
long memory properties in the volatility process. Many of these models are 
surveyed in Bollerslev, Chou, and Kroner [1992], Bollerslev, Engle, and Nelson 
[1994], Engle [2004]. Among them, we may cite the popular Exponential GARCH 
model by Nelson [1991] as well as the GJR model by Glosten, Jaganathan, and 
Runkle [1993] which both account for the asymmetric relation between stock re- 
turns and changes in variance [see Black 1976]. An additional class of GARCH 
models, referred to as regime-switching GARCH, has gained particular attention 
in recent years. In these models, the scedastic function’s parameters can change 
over time according to a latent (i.e., unobservable) variable taking values in the 
discrete space {1,..., A}. The interesting feature of these models lies in the 
fact that they provide an explanation of the high persistence in volatility, 7.e., 
nearly unit root process for the conditional variance, observed with single-regime 
GARCH models [see, e.g., Lamoureux and Lastrapes 1990]. Furthermore, these 
models are apt to react quickly to changes in the volatility level which leads 
to significant improvements in volatility forecasts as shown by Dueker [1997], 
Klaassen [2002], Marcucci [2005]. Further details on regime-switching GARCH 
models can be found in Haas, Mittnik, and Paolella [2004], Hamilton and Susmel 
[1994]. 

The Maximum Likelihood (henceforth ML) estimation technique is the gen- 
erally favored scheme of inference for GARCH models, although semi- and non- 
parametric techniques have also been applied by some authors [see, e.g., Gallant 
and Tauchen 1989, Pagan and Schwert 1990]. The primary appeal of the ML 
technique stems from the well-known asymptotic optimality conditions of the 
resulting estimators under ideal conditions [see Bollerslev et al. 1994, Lee and 
Hansen 1994]. In addition, the ML procedure is straightforward to implement 
and is nowadays available in econometric packages. However, while conceptually 
simple, we may encounter practical difficulties when dealing with the ML esti- 
mation of GARCH models. First, the maximization of the likelihood function 
must be achieved via a constrained optimization technique. The model param- 
eters must indeed be positive to ensure a positive conditional variance and it 
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is also common to require that the covariance stationarity condition holds (this 
condition is )7/_, a; + 04_, 8; < 1 for the GARCH(p,q) model [see Boller- 
slev 1986, Thm.1, p.310]). The optimization procedure subject to inequality 
constraints can be cumbersome and does not necessarily converge if the true pa- 
rameter values are close to the boundary of the parameter space or if the process 
is nearly non-stationary. The maximization is even more difficult to achieve in 
the context of regime-switching GARCH models where the likelihood surface is 
multimodal. Depending on the numerical algorithm, ML estimates often prove 
to be sensitive with respect to starting values. Moreover, the covariance matrix 
at the optimum can be extremely tedious to obtain and ad-hoc approaches are 
often required to get reliable results (e.g., Hamilton and Susmel [1994] fix some 
transition probabilities to zero in order to determine the variance estimates for 
some model parameters). Second, as noted by Geweke [1988, p.77], in classical 
applications of GARCH models, the interest usually does not center directly on 
the model parameters but on possibly complicated nonlinear functions of the 
parameters. For instance, in the case of the GARCH(p, g) model, one might be 
interested in the unconditional variance, denoted by h,, which is given by: 


hy = ie 

ae he Lea Hi 2; 

provided that the covariance stationarity condition is satisfied. To assess the 
uncertainty of this quantity, classical inference involves tedious delta methods, 
simulation from the asymptotic Normal approximation of the parameter esti- 
mates or the bootstrap methodology. However, none of these techniques is com- 
pletely satisfactory. The delta method is an approximation which can be crude 
if the function of interest is highly nonlinear. The simulation and the bootstrap 
approaches can deal with nonlinear functions of the model parameters and give 
a full description of their distribution. Nevertheless, the former technique relies 
on asymptotic justifications and the latter method is very demanding since at 
each step of the procedure, a GARCH model is fitted to the bootstrapped data. 
Finally, in the case of regime-switching GARCH models, testing the null of 
versus K’ states is not possible within the classical framework. The regularity 
conditions for justifying the y? approximation of the likelihood ratio statistic 
do not hold as some parameters are undefined under the null hypothesis [see 
Frithwirth-Schnatter 2006, Sect.4.4]. 

Fortunately, difficulties disappear when Bayesian methods are used. First, 
any constraints on the model parameters can be incorporated in the model- 
ing through appropriate prior specifications. Moreover, the recent development 
of computational methods based on Markov chain Monte Carlo (henceforth 


4 1 Introduction 


MCMC) procedures can be used to explore the joint posterior distribution of the 
model parameters. These techniques avoid local maxima commonly encountered 
via ML estimation of regime-switching GARCH models. Second, exact distribu- 
tions of nonlinear functions of the model parameters can be obtained at low cost 
by simulating from the joint posterior distribution. In particular, we will show 
in Chap. 6 that, upon assuming that the underlying process is of GARCH type, 
the well known Value at Risk risk measure (henceforth VaR) can be expressed 
as a function of the model parameters. Therefore, the Bayesian approach gives 
an adequate framework to estimate the full density of the VaR. In conjunction 
with the decision theory framework, this allows to optimally choose a single 
point estimate within the density of the VaR, given our risk preferences. Hence, 
the Bayesian approach has a clear advantage in combining estimation and de- 
cision making. Lastly, in the Bayesian framework, the issue of determining the 
number of states can be addressed by means of model likelihood and Bayes fac- 
tors. All these reasons strongly motivate the use of the Bayesian approach when 
estimating GARCH models. 

The choice of the algorithm is the first issue when dealing with MCMC meth- 
ods and it depends on the nature of the problem under study. In the case of 
GARCH models, due to the recursive nature of the conditional variance, the 
joint posterior and the full conditional densities are of unknown forms, what- 
ever distributional assumptions are made on the model disturbances. Therefore, 
we cannot use the simple Gibbs sampler and need more elaborate estimation 
procedures. The initial approaches have been implemented using importance 
sampling [see Geweke 1988, 1989, Kleibergen and van Dijk 1993]. More re- 
cent studies include the Griddy-Gibbs sampler [see Ausin and Galeano 2007, 
Bauwens and Lubrano 1998] or the Metropolis-Hastings (henceforth M-H) algo- 
rithm with some specific choice of the proposal densities. The Normal random 
walk Metropolis is used in Miiller and Pole [1998], Vrontos, Dellaportas, and 
Politis [2000], Adaptive Radial-Based Direction Sampling (henceforth ARDS) 
is proposed by Bauwens, Bos, van Dijk, and van Oest [2004] while Nakatsuma 
[1998, 2000] constructs proposal densities from an auxiliary process. In the con- 
text of regime-switching ARCH models, Kaufmann and Frithwirth-Schnatter 
[2002], Kaufmann and Scheicher [2006] use the method of Nakatsuma [1998, 
2000] while Bauwens, Preminger, and Rombouts [2006], Bauwens and Rom- 
bouts [2007] rely on the Griddy-Gibbs sampler for regime-switching GARCH 
models. 

In the importance sampling approach, a suitable importance density is re- 
quired for efficiency which can be a bit of an art, especially if the posterior 
density is asymmetric or multimodal. In the random walk and independence M- 
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H strategies, preliminary runs and tuning are necessary. Therefore, the method 
cannot be completely automatic which is not a desirable property. The Griddy- 
Gibbs sampler of Ritter and Tanner [1992] is used by Bauwens and Lubrano 
[1998] in the context of GARCH models to get rid of these difficulties. This 
methodology consists in updating each parameter by inversion from the distri- 
bution computed by a deterministic integration rule. However, the procedure is 
time consuming and this can become a real burden for regime-switching mod- 
els which involve many parameters. Moreover, for computational efficiency, we 
must limit the range where the probability mass is computed so that the prior 
density has to be somewhat informative. In the case of the ARDS algorithm of 
Bauwens et al. [2004], the method involves a reparametrization in order to en- 
hance the efficiency of the estimation. This technique requires a large number of 
evaluations, which significantly slows down the estimation procedure compared 
to usual M-H approaches. Lastly, one could also use a Bayesian software such as 
BUGS [see Spiegelhalter, Thomas, Best, and Gilks 1995, Spiegelhalter, Thomas, 
Best, and Lunn 2007] for estimating GARCH models. However, this becomes 
extremely slow as the number of observations increases mainly due to the recur- 
sive nature of the conditional variance process. Moreover, the implementation 
of specific constraints on the model parameters is difficult and extensions to 
regime-switching specifications are limited. 

In the rest of the book, we will use the approach suggested by Nakatsuma 
(1998, 2000] which relies on the M-H algorithm where some model parameters 
are updated by blocks. The proposal densities are constructed from an auxiliary 
ARMA process for the squared observations. This methodology has the advan- 
tage of being fully automatic and thus avoids the time-consuming and difficult 
task, especially for non-experts, of choosing and tuning a sampling algorithm. 
We obtained very high acceptance rates with this M-H algorithm, ranging from 
89% to 95% for the single-regime GARCH(1, 1) model, which indicates that the 
proposal densities are close to the full posteriors. In addition, the approach of 
Nakatsuma [1998, 2000] is easy to extend to regime-switching GARCH models. 
In this case, the parameters in each regime can be regrouped and updated by 
blocks which may enhance the sampler’s efficiency. 


Organization of the book 


A short introduction to Bayesian inference and MCMC methods is given in 
Chap. 2. The rest of the book treats in detail the methodologies for the Bayesian 
estimation of single-regime and regime-switching GARCH models, proposes em- 
pirical applications to real data sets and illustrates some probabilistic state- 
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ments on nonlinear functions of the model parameters made possible under the 
Bayesian framework. 

In Chap. 3, we propose the Bayesian estimation of the parsimonious but 
effective GARCH(1,1) model with Normal innovations. We detail the MCMC 
scheme based on the methodology of Nakatsuma [1998, 2000]. An empirical 
application to a foreign exchange rate time series is presented where we compare 
the Bayesian and the ML estimates. In particular, we show that even for a fairly 
large data set, the point estimates and confidence intervals are different between 
the methods. Caution is therefore in order when applying the asymptotic Normal 
approximation for the model parameters in this case. We perform a sensitivity 
analysis to check the robustness of our results with respect to the choice of 
the priors and test the residuals for misspecification. Finally, we compare the 
theoretical and sample autocorrelograms of the process and test the covariance 
and strict stationarity conditions. 

In Chap. 4, we consider the linear regression model with conditionally 
heteroscedastic errors and exogenous or lagged dependent variables. We ex- 
tend the symmetric GARCH model to account for asymmetric responses to 
past shocks in the conditional variance process. To that aim, we consider the 
GJR(1,1) model of Glosten et al. [1993]. We fit the model to the Standard and 
Poors 100 (henceforth S&P 100) index log-returns and compare the Bayesian and 
the ML estimations. We perform a prior sensitivity analysis and test the resid- 
uals for misspecification. Finally, we test the covariance stationarity condition 
and illustrate the differences between the unconditional variance of the process 
obtained through the Bayesian approach and the delta method. In particular, 
we show that the Bayesian framework leads to a more precise estimate. 

In Chap. 5, we extend the linear regression model with conditionally hetero- 
scedastic errors by considering Student-t disturbances, which allows to model 
extreme shocks in a convenient manner. In the Bayesian approach, the heavy- 
tails effect is created by the introduction of latent variables in the variance 
process as proposed by Geweke [1993]. An empirical application based on the 
S&P100 index log-returns is proposed with a comparison between the estimated 
joint posterior and the asymptotic Normal approximation of the distribution of 
the estimates. We perform a prior sensitivity analysis and test the residuals for 
misspecification. Finally, we analyze the conditional and unconditional kurtosis 
of the underlying time series. 

In Chap. 6, we present some financial applications of the Bayesian estima- 
tion of GARCH models. We introduce the concept of Value at Risk risk measure 
and propose a methodology to estimate the density of this quantity for differ- 
ent risk levels and time horizons. This gives us the possibility to determine the 
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VaR. term structure and to characterize the uncertainty coming from the model 
parameters. Then, we review some basics in decision theory and use this frame- 
work as a rational justification for choosing a point estimate of the VaR. We 
show how agents facing different risk perspectives can select their optimal VaR 
point estimate and document, in an illustrative application, that the differences 
between individuals, in particular between fund managers and regulators, can 
be substantial in terms of regulatory capital. We show that the common testing 
methodology for assessing the performance of the VaR is unable to discrimi- 
nate between the point estimates but the deviations are large enough to imply 
substantial differences in terms of regulatory capital. This therefore gives an 
additional flexibility to the user when allocating risk capital. Finally, we extend 
our methodology to the Expected Shortfall risk measure. 

In Chap. 7, we extend the single-regime GJR model to the regime-switching 
GJR model (henceforth MS-GJR); more precisely, we consider an asymmetric 
version of the Markov-switching GARCH(1, 1) specification of Haas et al. [2004]. 
We introduce a novel MCMC scheme which can be viewed as an extension of 
the sampler proposed by Nakatsuma [1998, 2000]. Our approach allows to gen- 
erate the parameters of the MS-GJR model by blocks which may enhance the 
sampler’s efficiency. As an application, we fit a single-regime and a Markoy- 
switching GJR model to the Swiss Market Index log-returns. We use the random 
permutation sampler of Friihwirth-Schnatter [2001b] to find suitable identifica- 
tion constraints for the MS-GJR model and show the presence of two distinct 
volatility regimes in the time series. The generalized residuals are used to test 
the models for misspecification. By using the Deviance information criterion 
of Spiegelhalter, Best, Carlin, and van der Linde [2002] and by estimating the 
model likelihoods using the bridge sampling technique of Meng and Wong [1996], 
we show the in-sample superiority of the MS-GJR model. To test the predictive 
performance of the models, we run a forecasting analysis based on the VaR. 
In particular, we compare the MS-GJR model to a single-regime GJR model 
estimated on rolling windows and show that both models perform equally well. 
However, contrary to the single-regime model, the Markov-switching model is 
able to anticipate structural breaks in the conditional variance process and needs 
to be estimated only once. Then, we propose a methodology to depict the density 
of the one-day ahead VaR by simulation and document how specific forecasters’ 
risk perspectives can lead to different conclusions on forecasting performance 
of the model. A comparison with the traditional ML approach concludes the 
chapter. 

Finally, we summarize the main results of the book and discuss future av- 
enues of research in Chap. 8. 
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Bayesian Statistics and MCMC Methods 


“The people who don’t know they are Bayesian are called 
non-Bayesian.” 


— Irving J. Good 


This chapter gives a short introduction to the Bayesian paradigm for inference 
and an overview of the Markov chain Monte Carlo (henceforth MCMC) algo- 
rithms used in the rest of the book. For a more thorough discussion on Bayesian 
statistics, the reader is referred to Koop [2003], for instance. Further details on 
MCMC methods can be found in Chib and Greenberg [1996], Smith and Roberts 
[1993], Tierney [1994]. The reader who is familiar with these topics can skip this 
part of the book and go to the first chapter dedicated to the Bayesian estimation 
of GARCH models, on page 17. 

The plan of this chapter is as follows. The Bayesian paradigm is introduced 
in Sect. 2.1. MCMC techniques are presented in Sect. 2.2 where we introduce 
the Gibbs sampler as well as the Metropolis-Hastings algorithm. We also briefly 
discuss some practical implementation issues. 


2.1 Bayesian inference 


As in the classical approach to inference, the Bayesian estimation assumes a 
T x 1 vector y = (yi---yr)' of observations described through a probability 
density p(y | 0). The parameter 0 € © serves as an index of the family of 
possible distributions for the observations. It represents the characteristics of 
interest one would wish to know in order to obtain a complete description of the 
generating process for y. It can be a scalar, a vector, a matrix or even a set of 
these mathematical objects. For simplicity, we will consider 6 as a d-dimensional 
vector, hence 6 € © C R? in what follows. 

The difference between the Bayesian and the classical approach lies in the 
mathematical nature of @. In the classical framework, it is assumed that there 


exists a true and fixed value for parameter 6. Conversely, the Bayesian approach 


10 2 Bayesian Statistics and MCMC Methods 


considers @ as a random variable which is characterized by a prior density de- 
noted by p(@). The prior is specified with the help of parameters called hyper- 
parameters which are initially assumed to be known and constant. Moreover, 
depending on the researcher’s prior information, this density can be more or less 
informative. Then, by coupling the likelihood function of the model parameters, 
L(O | y) = ply | 0), with the prior density, we can invert the probability density 
using Bayes’ rule to get the posterior density p(@ | y) as follows: 


_ £@Ly)P) 
POLY) = FET y)p@)dd 


(2.1) 


This posterior is a quantitative, probabilistic description of the knowledge about 
the parameter 6 after observing the data. 

It is often convenient to choose a prior density which is conjugate to the like- 
lihood. That is, a density that leads to a posterior which belongs to the same 
distributional family as the prior. In effect, conjugate priors permit posterior 
densities to emerge without numerical integration. However, the easy calcula- 
tions of this specification comes with a price due to the restrictions they impose 
on the form of the prior. In many cases, it is unlikely that the conjugate prior 
is an adequate representation of the prior state of knowledge. In such cases, the 
evaluation of (2.1) is analytically intractable, so asymptotic approximations or 
Monte Carlo methods are required. Deterministic techniques can provide good 
results for low dimensional models. However, when the dimension of the model 
becomes large, simulation is the only way to approximate the posterior density. 


2.2 MCMC methods 


The idea of MCMC sampling was first introduced by Metropolis, Rosenbluth, 
Rosenbluth, Teller, and Teller [1953] and was subsequently generalized by Hast- 
ings [1970]. For ease of exposition, we will restrict the presentation to the context 
of Bayesian inference. A general and detailed statistical theory of MCMC meth- 
ods can be found in Tierney [1994]. 

The MCMC sampling strategy relies on the construction of a Markov chain 
with realizations 9!!, @l4J,...,6l],... in the parameter space ©. Under appropri- 
ate regularity conditions [see Tierney 1994], asymptotic results guarantee that 
as j tends to infinity, then 6”! tends in distribution to a random variable whose 
density is p(6@ | y). Hence, the realized values of the chain can be used to make 
inference about the joint posterior. All we require are algorithms for construct- 
ing appropriately behaved chains. The best known MCMC algorithms are the 
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Gibbs sampler and the Metropolis-Hastings (henceforth M-H) algorithm. These 
samplers are nowadays essential tools to perform realistic Bayesian inference. 


2.2.1 The Gibbs sampler 


The Gibbs sampler is possibly the MCMC sampling technique which is used 
most frequently. In the statistical physics literature, it is known as the heat bath 
algorithm. Geman and Geman [1984] christened it in the mainstream statisti- 
cal literature as the Gibbs sampler. An elementary exposition can be found in 
Casella and George [1992]. See also Gelfand and Smith [1990], Tanner and Wong 
[1987] for practical examples. 

The Gibbs sampler is an algorithm based on successive generations from 
the full conditional densities p(6; | 04:,y), t.e., the posterior density of the ith 
element of 0 = (6; --- 6a)’, given all other elements, where elements of 6 can be 
scalars or sub-vectors. In practice the sampler works as follows: 


1. Initialize the iteration counter of the chain to 7 = 1 and 
set an initial value 0! = (6!0!...9!ly, 

2. Generate a new value 6! from 6-1] through successive 
generation values: 


a!) ~ p(s | 04," y) 


A!) ~ pO. | 0%), 8-4... ,@8—" y) 


a) ~ p(6a| Ou!,,y); 


3. Change counter j7 to 7 +1 and go back to step 2 until 
convergence is reached. 


As the number of iterations increases, the chain approaches its stationary dis- 
tribution and convergence is then assumed to hold approximately [see Tierney 
1994]. Sufficient conditions for the convergence of the Gibbs sampler are given 
in Roberts and Smith [1994, Sect.4]. As noted in Chib and Greenberg [1996, 
p.414], these conditions ensure that each full conditional density is well defined 
and that the support of the joint posterior is not separated into disjoint regions 
since this would prevent exploration of the full parameter space. Although these 
are only sufficient conditions for the convergence of the Gibbs sampler, they are 
extremely weak and are satisfied in most applications. 

The Gibbs sampler is the most natural choice of MCMC sampling strategy 
when it is easy to write down full conditionals from which we can easily generate 
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draws. When the expression of p(4; | 04:,y) is nonstandard, we might consider 
rejection methods [see, e.g., Ripley 1987], the Griddy-Gibbs sampler when 6; is 
univariate [see Ritter and Tanner 1992], adaptive rejection sampling [see Gilks 
and Wild 1992] or M-H sampling as shown in the next section. 


2.2.2 The Metropolis-Hastings algorithm 


Some complicated Bayesian problems cannot be solved by using the Gibbs sam- 
pler. This is the case when it is not easy to break down the joint density into 
full conditionals or when the full conditional densities are of unknown form. 
The M-H algorithm is a simulation scheme which allows to generate draws from 
any density of interest whose normalizing constant is unknown. The algorithm 
consists of the following steps. 


1. Initialize the iteration counter to 7 = 1 and set an initial 
value 61°, 

2. Move the chain to a new value 6* generated from a pro- 
posal (candidate) density q(e | 6%—1)); 

3. Evaluate the acceptance probability of the move from 
6-1] to @* given by: 


min £ 2% Ly) (084 | 6") 
p(OV-4 | y) q(@* | AUN)?" J” 


If the move is accepted, set 64] = 6*, if not, set 94] = gU-H 
so that the chain does not move; 

4. Change counter from 7 to 7+1 and go back to step 2 until 
convergence is reached. 


As in the Gibbs sampler, the chain approaches its equilibrium distribution as 
the number of iterations increases [see Tierney 1994]. The power of the M-H 
algorithm stems from the fact that the convergence of the chain is obtained 
for any proposal g whose support includes the support of the joint posterior 
[see Roberts and Smith 1994, Sect.5]. It is however crucial that g approximates 
closely the posterior to guarantee an acceptance rate which is reasonable. 

With no intention of being exhaustive, some comments are in order here. 
If we choose a symmetric proposal density, i.e., q(0" | 6*) = q(@* | 0U!), the 
acceptance probability of the M-H algorithm reduces to: 


nin | amr yy"! 
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so that the proposal does not need to be evaluated. This simpler version of 
the M-H algorithm is known as the Metropolis algorithm because it is the orig- 
inal algorithm by Metropolis et al. [1953]. A special case consists of a pro- 
posal density which only depends on the distance between 6* and 6~1], i.e., 
q(o* | 0¥-) = q(@* — 04-1]). The resulting algorithm is referred to as the 
random walk Metropolis algorithm. For instance, g could be a multivariate Nor- 
mal density centered at previous draw 99-1] and whose covariance matrix is 
calibrated to take steps which are reasonably close to 64—"] such that the prob- 
ability of accepting the candidate is not too low, but with a step size large 
enough to ensure a sufficient exploration of the parameter space. The drawback 
of this method is that it is not fully automatic since the covariance matrix needs 
to be chosen carefully; thus preliminary runs are required. 

Another special case of the M-H sampler is the independence M-H algorithm, 
in which proposal draws are generated independently of the current position of 
the chain, i.e., q(0* | 99-4) = q(6*). This algorithm is often used with a Normal 
or a Student-t proposal density whose moments are estimated from previous runs 
of the MCMC sampler. This approach works well for well-behaved unimodal 
posterior densities but may be very inefficient if the posterior is asymmetric or 
multimodal. 

Finally, we note that in the form of the M-H algorithm we have presented, 
the vector @ is updated in a single block at each iteration so that all elements 
are changed simultaneously. However, we could also consider componentwise 
algorithms where each component is generated by its own proposal density [see 
Chib and Greenberg 1995, Tierney 1994]. In fact, the Gibbs belongs to this class 
of samplers where each component is updated sequentially, and where proposal 
densities are the full conditionals. In this case, new draws are always accepted 
[see Chib and Greenberg 1995]. The M-H algorithm is often used in conjunction 
with the Gibbs sampler for those components of 6 that have a conditional density 
that cannot be sampled from directly, typically because the density is known 
only up to a scale factor [see Tierney 1994]. 


2.2.3 Dealing with the MCMC output 


Having examined the building-blocks for the standard MCMC samplers, we now 
discuss some issues associated with their practical implementation. In particular, 
we comment on the manner we can assess their convergence, the way we can 
account for autocorrelation in the chains and how we can obtain characteristics 
of the joint posterior from the MCMC output. Further details can be found in 
Kass, Carlin, Gelman, and Neal [1998], Smith and Roberts [1993]. 
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Several statistics have been devised for assessing convergence of MCMC out- 
puts. The basic idea behind most of them is to compare moments of the sampled 
parameters at different parts of the chain. Alternatively, we can compare several 
sequences drawn from different starting points and check that they are indistin- 
guishable as the number of iterations increases. We refer the reader to Cowles 
and Carlin [1996], Gelman [1995] for a comparative review of these techniques. 
In the rest of the book, we will use a methodology based on the analysis of 
variance developed by Gelman and Rubin [1992]. More precisely, the approxi- 
mate convergence is diagnosed when the variance between different sequences 
is no larger than the variance within each individual sequence. Apart from for- 
mal diagnostic tests, it is also often convenient to check convergence by plotting 
the parameters’ draws over iterations (trace plots) as well as the cumulative or 
running mean of the drawings. 

Regarding the Monte Carlo (simulation) error, it is crucial to understand 
that the draws generated by a MCMC method are not independent. The auto- 
correlation either comes from the fact that the new draw depends on the past 
value of the chain or that the old element is duplicated. When assessing the pre- 
cision of an estimator, we must therefore rely on estimation techniques which 
account for this autocorrelation [see, e.g., Geweke 1992, Newey and West 1987]. 
In the rest of the book, we will estimate the numerical standard errors, that is 
the variation of the estimates that can be expected if the simulations were to be 
repeated, by the method of Andrews [1991], using a Parzen kernel and AR(1) 
pre-whitening as presented in Andrews and Monahan [1992]. As noted by De- 
schamps [2006], this ensures easy, optimal, and automatic bandwidth selection. 

After the run of a Markov chain and its convergence to the stationary distri- 
bution, a sample {oll} 7_, from the joint posterior density p(@ | y) is available. 
We can thus approximate the posterior expectation of any function €(8) of the 
model parameters: 


‘aly [€(8)] = ie €(0)p(6 | yao (2.2) 


by averaging over the draws from the posterior distribution in the following 


manner: 


é= €(oli!) 


1 

J = 
Under mild conditions, the sample average € converges to the posterior expec- 
tation by the law of large numbers, even if the draws are generated by a MCMC 
sampler [see Tierney 1994]. Some particular cases of (2.2) allow to obtain char- 


acteristics of the joint posterior. For instance, when €(@) = @ we obtain the 
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posterior mean vector 6; for (0) = (0 — 0)(0 — 0)’ we obtain the posterior co- 
variance matrix; for €(0) = Iygec}, where I,} denotes the indicator function 
which is equal to one if the constraint holds and zero otherwise, we obtain the 
posterior probability of a set C. Finally, if we are interested in the marginal 
posterior density of a single component of 6, we can estimate it through a his- 
togram or a kernel density estimate of sampled values [see Silverman 1986]. By 
contrast, deterministic numerical integration is often intractable. 


3 


Bayesian Estimation of the GARCH(1, 1) Model 


with Normal Innovations 


“Large changes tend to be followed by large changes (of 
either sign) and small changes tend to be followed by 
small changes.” 


— Benoit Mandelbrot 


(...) “it is remarkable how large a sample is required for 
the Normal distribution to be an accurate approximation.” 


— Robert McCulloch and Peter E. Rossi 


In this chapter, we propose the Bayesian estimation of the parsimonious but ef- 
fective GARCH(1, 1) model with Normal innovations. We sample the joint poste- 
rior distribution of the parameters using the approach suggested by Nakatsuma 
(1998, 2000]. As a first step, we fit the model to foreign exchange log-returns 
and compare the Bayesian and the Maximum Likelihood estimates. Next, we 
analyze the sensitivity of our results with respect to the choice of the priors 
and test the residuals for misspecification. Finally, we illustrate some appealing 
aspects of the Bayesian approach through probabilistic statements made on the 
parameters. 

The plan of this chapter is as follows. We set up the model in Sect. 3.1. 
The MCMC scheme is detailed in Sect. 3.2. The empirical results are presented 
in Sect. 3.3. We conclude with some illustrative applications of the Bayesian 
approach in Sect. 3.4. 


3.1 The model and the priors 
A GARCH(1,1) model with Normal innovations may be written as follows: 
ye = ehi/? for t= Tyaccag cl’ 


e, “ N(0,1) (3.1) 
ht = a0 + oye + Bhe-1 
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where ag > 0, ay > 0 and @ = 0 to ensure a positive conditional variance 
and ho = yo = 0 for convenience; N’(0,1) is the standard Normal density. In 
this setting, the conditional variance h; is a linear function of the squared past 
observation and the past variance. 

In order to write the likelihood function, we define the vectors y = (yi --- yr)! 
and @ = (ap a1)’ and we regroup the model parameters into w = (a, 3) for 
notational purposes. In addition, we define the T x T diagonal matrix: 


x = X(p) = diag({he(d) }21) 


where: 


hi(a) = ao + aryz_y + Bhe-i(#) . 


From there, the likelihood function of y can be expressed as follows: 
LW | y) x (det ©)~"/? exp [-4y/E-ty] . 


We propose the following proper priors on the parameters a and ( of the 
preceding model: 


p(@) ee N2(a | Hg La) pa>0} 
p(B) x N(B | ue, Da )Igasoy 


where py. and &, are the hyperparameters, I;,} is the indicator function which 
equals unity if the constraint holds and zero otherwise, 0 is a 2 x 1 vector of 
zeros and Ny is the d-dimensional Normal density (d > 1). In addition, we 
assume prior independence between parameters @ and ( which implies that 
p(w) = p(a)p(Z). Then, we construct the joint posterior density via Bayes’ rule: 


Pb ly) x LY | y)p(y) . (3.2) 


3.2 Simulating the joint posterior 


The recursive nature of the variance equation in model (3.1) does not allow 
for conjugacy between the likelihood function and the prior density in (3.2). 
Therefore, we rely on the M-H algorithm to draw samples from the joint posterior 
distribution. The algorithm in this section is a special case of the algorithm 
described by Nakatsuma [1998, 2000]. We draw an initial value Wl = (all, gl!) 
from the joint prior and we generate iteratively J passes for yw. A single pass is 
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decomposed as follows: 


a) pla:|Be-4,y) 
BY) ~ p(B | ally) . 


Since no full conditional density is known analytically, we sample parameters 
a and (? from two proposal densities. These densities are obtained by noting 
that the GARCH(1, 1) model can be written as an ARMA(1,1) model for {y?}. 
Indeed, by defining w; = y? — ht, we can transform the expression of the condi- 
tional variance as follows: 


he = a9 + a1y7_1 + Bht-1 
> yf = a0 + (a1 + B)ye_y — Buri t+ wr (3.3) 


where w; can be written as: 


t 


2 
ue = 92h = (HE - ) m= (81) hu 


with y? denoting a Chi-squared variable with one degree of freedom. Hence, by 
construction, {w,} is a Martingale Difference process with a conditional mean 
of zero and a conditional variance of 2h? since a x7 variable has a unit mean 
and a variance equal to two. 

Following Nakatsuma [1998, 2000], we construct an approximate likelihood 
for parameters @ and (@ from expression (3.3). The procedure consists in ap- 
proximating first the variable w; by a variable z, which is Normally distributed 
with a mean of zero and a variance of 2h?. This leads to the following auziliary 
model: 

Yt = a0 + (a1 + B)ye_y — Ba-1 + 2 - 


Then, by noting that z, is a function of ~ given by: 


zi(w) = yf — a0 — (a1 + B)y?_1 + Beri) (3.4) 


and by defining the T x 1 vector z = (z--- zr)’ as well as the T x T diagonal 


matrix: 
A = A(}) = diag({2h?()}7-4) 


we can approximate the likelihood function of w from the auxiliary model as 
follows: 


Lb | y) « (det A)~"/? exp [—4z/A71z] . (3.5) 
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As will be shown hereafter, the construction of the proposal densities for pa- 
rameters @ and ( is based on this approximate likelihood function. 


3.2.1 Generating vector a 


Recursive transformations initially proposed by Chib and Greenberg [1994] allow 
to express the function z,(y) in (3.4) as a linear function of the 2 x 1 vector a. 
Let us define v; = y? for notational convenience. The recursive transformations 
are defined as follows: 


W=14+6l,, 

Up = Ue-1+ Bue, 
where 1} = vj = 0. As shown in Prop. A.1 (see App. A), upon defining the 
2x1 vector c, = (ly uy)’, the function z; can be expressed as z; = v;—c,a@. Then, 
by considering the T x 1 vector v = (v1 --+-vr)/ and the T x 2 matrix C’ whose 


tth row is c}, we get z = v — Ca. Therefore, we can express the approximate 
likelihood function of parameter a@ as follows: 


L(a@ | B,y) « (det A)~/? exp [—3(v — Ca)'A7}(v — Cay] . 


The proposal density to sample vector @ is obtained by combining this likelihood 
function and the prior density by the usual Bayes update: 


a (a | a, By) x No2(a | Has Sa) paca) 
with: 


pages Ol aaa OF ah aay 
fig = Be(C'A“v + Da") 
where the Tx T diagonal matrix A = diag ({2h?(@, 3)}#_,) and @ is the previous 


draw of @ in the M-H sampler. A candidate a* is sampled from this proposal 
density and accepted with probability: 


. f p(a*,8 | y) do(@ | a*, B,y) \ 
min { Bly) da(a* a, By)" S 


3.2.2 Generating parameter 3 


The function z,(y) in (3.4) could be expressed, in the previous section, as a 
linear function of parameter @ but cannot be expressed as a linear function of 


3.2 Simulating the joint posterior 21 


3. To overcome this problem, we linearize z;(() by a first order Taylor expansion 
at point (3: 


where B is the previous draw of parameter (@ in the M-H sampler. Furthermore, 
let us define the following: 


re=2(@)+6Vi , Vez — 


where the terms V; can be computed by the following recursion: 
Vi = yp — 4-1(8) + BVi-n 


with Vo = 0. This recursion is simply obtained by differentiating (3.4) with 
respect to 3. Then, we regroup these terms into the T x 1 vectors r = (r1--- rr)! 
and V = (V,--:Vr)! and we approximate the term within the exponential 
in (3.5) by z ~ r—PV. This yields the following approximate likelihood function 
for parameter (3: 


L(3 | a,y) x (det A)~!/? exp [-3(r — BV)'A“*(r — BV)] . 


The proposal density to sample ( is obtained by combining this likelihood and 
the prior density by Bayes’ update: 


qa(B | a, B,y) « N(B | fig, Ua)Ice>0} 
with: 
i, VA Vy 2" 
jig = Lp(V'A-*r + Dg 'p,) 


where the T x T diagonal matrix A = diag({2h?(a, py\7 4). A candidate 3% is 
sampled from this proposal density and accepted with probability: 


in | lea ly) (Bl o8 sy) | 
P(@, 8B] y) 9e(8* | a, 8,y) 
We end this section with some comments regarding the implementation of the 


MCMC scheme. The program is written in the R language [see R, Development 
Core Team 2007] with some subroutines implemented in C in order to speed up 
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the simulation procedure. The validity of the algorithm as well as the correctness 
of the computer code are verified by a variant of the method proposed by Geweke 
[2004]. We sample w from a proper joint prior and generate some passes of the 
M-H algorithm. At each pass, we simulate the dependent variable y from the 
full conditional p(y | w) which is given by the conditional likelihood. This way, 
we draw a sample from the joint density p(y, w). If the algorithm is correct, the 
resulting replications of w should reproduce the prior. The Kolmogorov-Smirnov 
empirical distribution test does not reject this hypothesis at the 1% significance 
level. 


3.3 Empirical analysis 


We apply our Bayesian estimation method to daily observations of the Deutsch- 
mark vs British Pound (henceforth DEM/GBP) foreign exchange log-returns. 
The sample period is from January 3, 1985, to December 31, 1991, for a total of 
1’974 observations. The nominal returns are expressed in percent as in Bollerslev 
and Ghysels [1996]. This data set has been proposed as an informal benchmark 
for GARCH time series software validation and is available from the Journal of 
Business and Economic Statistics at ftp://www.amstat.org/. From this time 
series, the first 750 observations, which is about three financial years, are used 
to illustrate the Bayesian approach. The data set is large enough to perform 
classical Maximum Likelihood (henceforth ML) estimation and apply asymp- 
totic justifications. Hence, we have an interesting point of view from which to 
compare classical and Bayesian approaches. The remaining data set will be used 
in an empirical analysis proposed in Chap. 6. 

The observation window excerpt from our data set is plotted in the upper 
part of Fig. 3.1. We test for autocorrelation in the time series by testing the 
joint nullity of autoregressive coefficients for {y;}. We estimate the regression 
with autoregressive coefficients up to lag 20 and compute the covariance matrix 
using the White estimate. The p-values of the Wald test is 0.377 which does 
not support the presence of autocorrelation. However, from Fig. 3.1, we clearly 
observe clusters of high and low variability in the time series. This phenomenon 
is well known in financial data and is referred to as volatility clustering. This 
effect is emphasized in the lower part of the figure where the sample autocorrelo- 
gram of squared observations is displayed. In this case, the first autocorrelations 
are large and significant, indicating GARCH effects; the Wald test strongly re- 
jects the null hypothesis of the absence of autocorrelation in the squares. As an 
additional data analysis, we test for unit root using the test by Phillips and Per- 
ron [1988]. The test strongly rejects the I(1) hypothesis. From this preliminary 
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Fig. 3.1. DEM/GBP foreign exchange daily log-returns (upper graph) and sample 
autocorrelogram of the squared log-returns (lower graph). 


analysis, we conclude that the time series is not integrated and does not exhibit 
autocorrelation. However, we strongly suspect the presence of GARCH effects 
in the data. 
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3.3.1 Model estimation 


We fit the parsimonious GARCH(1,1) model to the data for this observation 
window. As prior densities for the Bayesian estimation, we choose truncated 
Normal densities with zero mean vectors and diagonal covariance matrices. The 
variances are set to 10’000 so we do not introduce tight prior information in our 
estimation (see Sect. 3.3.2 for a formal check). Finally, we recall that the joint 
prior is constructed by assuming prior independence between @ and 7. We run 
two chains for 10’000 passes each. We emphasize the fact that only positivity 
constraints are implemented in the M-H algorithm, through the prior densities; 
no stationarity conditions are imposed in the simulation procedure. In addition, 
we estimate the model by the usual ML technique for comparison purposes. 

In Fig. 3.2, the running means are plotted over iterations. For all param- 
eters, we notice a convergence of the two chains toward a constant value after 
something like 5’000 iterations. As a formal check, we follow Gelman and Rubin 
[1992] where the authors elaborated the idea that the chain trajectories should 
be the same after convergence using analysis of variance techniques. Considering 


m parallel chains and a real function € = €(w) of the model parameters, there 


are m trajectories of length J given by fell — 4 =1,...,m. The variances 
between chains and within chains, respectively, denoted by B and W, are then 


defined as follows: 


i ae = 
Beat os é) 
we syrell_zy 
m(J —1) i=1 j=1 ; 


where €; is the average of observations of the ith chain and € is the average of 
these averages. After convergence, all these mJ values for €; are drawn from the 
posterior distribution, and 0%, the variance of €, can be consistently estimated 
by W, B as well as the following weighted average: 


1 1 
A2Q + 


If the chains have not yet converged, then initial values will still be influencing 
the trajectories and, due to their overdispersion, will force oz to overestimate 
oz until stationarity is reached. On the other hand, before convergence, W will 
tend to underestimate oz because each chain will not have adequately traversed 
the complete state space. Following this reasoning, Gelman and Rubin [1992] 
construct an indicator of convergence; this is the estimator of potential scale 
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reduction factor given by: 


As the simulation converges, the potential scale reduction declines to one, mean- 
ing that the m parallel chains are essentially overlapping. Gelman and Rubin 
[1992] suggests accepting convergence when the value of R is below 1.2. Since 
this indicator is subject to estimation error, asymptotic confidence bands can be 
constructed and the 97.5th percentile is used as a conservative point estimate. 

In our context, we test the convergence of the chains by using the following 
functions: 


Ey) =a0 , &(b)=a and €() =~. 


For these three functions, the diagnostic test by Gelman and Rubin [1992] does 
not lead to the rejection of the convergence if we consider the second half of the 
simulated values; the 97.5th percentile values for R indeed belong to the interval 
(1.04, 1.05]. We can therefore be confident that the generated parameters are 
drawn from the joint posterior distribution. 

Complementary analyses of the MCMC output are also worth mentioning at 
this point. In particular, we note that the one-lag autocorrelations in the chains 
range from 0.75 for parameter a, to 0.95 for @ which is reasonable. Moreover, 
the sampling algorithm allows to reach very high acceptance rates ranging from 
89% for vector a to 95% for 3, suggesting that the proposal densities are close 
to the full conditionals. On the basis of these results, we discard the first 5’000 
draws from the overall MCMC output as a burn-in period and merge the two 
chains to get a final sample of length 10’000. 

The posterior statistics as well as the ML results are reported in Table 3.1. 
First, we note that even though the number of observations is large, the ML es- 
timates and the Bayesian posterior means are different; the ML point estimate 
is lower for components of vector a and higher for parameter 3. We also notice 
a difference between the 95% confidence intervals. Whereas the confidence band 
is symmetric in the ML case due to the asymptotic Normality assumption, this 
is not true for the posterior confidence intervals. The reason can be explained 
through Fig. 3.3 where the marginal posterior densities of the parameters are 
displayed. We clearly notice the asymmetric shape of the histograms for parame- 
ters ag and a1; the skewness values are 0.46 and 0.39, both significantly different 
from zero at the 1% significance level. Therefore the ML confidence band has 
a tendency to underestimate the right boundary of the 95% confidence interval 
for these parameters. In the case of parameter (, the skewness is —0.09, also 
significant; in this case, the Maximum Likelihood approach overestimates the 
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Table 3.1. Estimation results for the GARCH(1,1) model with Normal 


innovations.* 
yp YMLE wo wos Wo.02 Woo75 min max IF 
Qo 0.039 0.048 0.047 0.022 0.080 0.011 0.119 9.79 
(0.014,0.064] (0.448) 
ont 0.198 0.226 0.223 0.128 0.337 0.083 0.499 5.85 
[0.102,0.294] (1.284) 
B 0.686 0.636 0.636 0.476 0.795 0.338 0.849 40.79 


(0.538,0.833] (5.021) 


* we: Maximum Likelihood estimate; a: posterior mean; we: estimated pos- 
terior quantile at probability ¢; min: minimum value; max: maximum value; IF: 
inefficiency factor (i.e., ratio of the squared numerical standard error and the 
variance of the sample mean from a hypothetical izd sampler); [e]: Maximum 
Likelihood 95% confidence interval; (e): numerical standard error (x 10°). The 
posterior statistics are based on 10’000 draws from the joint posterior sample. 


left boundary of the 95% confidence band. Moreover, as shown in the bottom 
right-hand side of the figure, the joint density of parameters ao and (3 is slightly 
different from the ellipsoid obtained with the asymptotic Normal approximation. 
Therefore, these results warn us against the abusive use of asymptotic justifi- 
cations. In the present case, even 750 observations do not suffice to justify the 
asymptotic Normal approximation for the parameters estimates. 

The last column of Table 3.1 reports the inefficiency factors (IF) for the 
different parameters. Their values are computed as the ratio of the squared nu- 
merical standard error of the posterior sample and the variance estimate divided 
by the number of iterations (7.e., the variance of the sample mean from a hy- 
pothetical iid sequence). The numerical standard errors are estimated by the 
method of Andrews [1991], using a Parzen kernel and AR(1) pre-whitening as 
presented in Andrews and Monahan [1992]. As noted by Deschamps [2006], this 
ensures easy, optimal, and automatic bandwidth selection. In our estimation, 
using 10’000 simulations out of the posterior distribution seems appropriate if 
we require that the Monte Carlo error in estimating the mean is smaller than 
0.4% of the variation of the error due to the data. The larger inefficiency factor 
reported for parameter (3 is reflected in a larger autocorrelation in the simulated 
values. 
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Parameter op 
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o- 
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Fig. 3.3. Marginal posterior densities of the GARCH(1,1) parameters; upper graph: 
parameter ao; lower graph: parameter a;. The histograms are based on 10’000 draws 
from the joint posterior sample. 
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Parameter B 
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Fig. 3.3. (cont.) Marginal posterior densities of the GARCH(1, 1) parameters; upper 
graph: parameter 3; lower graph: scatter plot of (ao,(). Both graphs are based on 
10’°000 draws from the joint posterior sample. 
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3.3.2 Sensitivity analysis 


The Bayesian approach is often criticized on the grounds that the choice of the 
prior density may have a non negligible impact on the posterior density and, 
consequently, bias the posterior results. It is therefore important to determine 
the extent of this impact through a sensitivity analysis. To that aim, we follow 
Geweke [1999] who proposes a methodology to estimate the Bayes factors for 
the initial model against a model with an alternative prior. While the Bayes 
factor is a quantity which is often difficult to estimate, Geweke [1999, Sect.2] 
shows that it is possible to approximate the Bayes factor between two models 
differing only by their prior densities using the posterior simulation output from 
just one of the models. This approach provides an attractive way of performing 
sensitivity analysis since it does not require the estimation of the alternative 
model. 

More precisely, let us denote by pr(y) the initial prior density for w = (a, (3) 
and by pa(w) the alternative prior used to test the sensitivity of the posterior 
density. Based on the Tx 1 vector of observations y = (y1--- yr)’, the Bayes fac- 
tor in favor of the alternative model A over the initial model I can be expressed 


A 
BF a1 = LOI) 


p(y | I) 


where the marginal densities are found by integrating out the parameters: 


as follows: 


ply |) = f £0 |y)pe(w)ae 
Developing the Bayes factor using the expression of the marginal densities yields: 


pr, , - L£@ yeaa 
AT SL Ly pra 
Clb | y)on(y) Bap 
SLO | y)rr(h)d 
= fe | Lip | y)pr() Jaw 
piv) LLL | y)pr()dw 
= | BO ow ly. daw 
=E pate) 
VD [ i(w) 


where the notation E,)y 7) emphasizes the fact that the posterior expectation 


is calculated with respect to the initial prior py. In this simple context, we 
thus notice that the Bayes factor is nothing else than the posterior expectation 
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under the initial prior of the ratio of prior densities. The posterior expectation 
can therefore be estimated using the joint posterior sample {pbl}7_, as follows: 


[4]) 
BFay1 = Eyiy,1 E a 23 ms ea : (3.6) 


We test the sensitivity of our posterior results by considering three alterna- 
tive priors which are truncated Normal densities as the initial prior. We choose 
however different hyperparameters, in particular larger variances in the covari- 
ance matrices. Formally, the alternative priors may be expressed as follows: 


p(a) « No(@ | 12,07 12)Ifas0} 
p(B) « N(B | 2,07) caso} 


where t2 is a 2 x 1 vector of ones, Iz is a 2 x 2 identity matrix, yz is the prior 
mean and o? the prior variance; their values are given in the first two columns 
of Table 3.2. The Bayes factors are estimated using approximation (3.6) based 
on 10’000 draws from the joint posterior sample. The discrimination between 
models is then based on the Jeffrey’s scale of evidence [see Kass and Raftery 
1995, Sect.3.2] which can be summarized as follows: 


e Strong evidence in favor of the initial prior compared to the alter- 
native prior: 
BFasr < 0.1 


e Moderate evidence in favor of the initial prior compared to the 


alternative prior: 
0.1 < BFayt < 0.3125 


e Weak evidence in favor of the initial prior compared to the alter- 
native prior: 
0.3125 < BFasy <1. 


Estimated BF are reported in the last column of Table 3.2. The numerical 
standard errors are not shown since their values are negligible. First, we note 
that a change in the prior mean has no impact on the BF. On the contrary, 
larger variances in the alternative covariance matrices diminishes the value of 
Bayes factors to 0.866; this indicates a weak evidence for the initial specifica- 
tion relative to the alternative priors. Therefore, for each alternative prior, the 
estimated BF confirms that our initial choice is vague enough and does not 
introduce significant information in our estimation. 
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Table 3.2. Results of the sensitivity 


analysis.* 


Alternative priors 


Lb o BF 
1.00 10000 1.000 
0.00 11000 0.866 
1.00 11000 0.866 


* The alternative priors are truncated 
Normal densities; ju: prior mean; 0? prior 
variance; BF: Bayes factor. 


3.3.3 Model diagnostics 


We test the residuals for possible misspecification. The standardized residuals 
are defined by: 


ao YE 
=~ 
hy? 
fort =1,...,750, where hi is the conditional variance computed with wo.5, the 


median of the posterior sample. If the statistical assumptions in (3.1) are sat- 
isfied, these residuals should be independent and Normally distributed asymp- 
totically. 

In the upper part of Fig. 3.4, we display the residuals over time. No au- 
tocorrelation or heteroscedasticity are visually apparent. We test for autocor- 
relation using the Ljung-Box test up to lag 20 [see Ljung and Box 1978]. The 
test does not reject the null hypothesis of absence of autocorrelation at the 5% 
significance level (p-value = 0.652). This is also true for the squared residuals 
(p-value = 0.961). Therefore, the GARCH(1,1) process has been able to filter 
the heteroscedastic nature of the data. We form a quantile-quantile plot of the 
residuals against the Normal distribution in the lower graph of the figure. The 
distribution is almost Normal at its center whereas the tails are slightly fatter, 
especially the left one. The Kolmogorov-Smirnov Normality test rejects the null 
hypothesis at the 5% significance level (p-value = 0.008). The tails of the innova- 
tions’ distribution are not fat enough to fully capture the distributional nature 
of the data. This point will be addressed in Chap. 5 with the introduction of 
Student-t disturbances in the modeling. 
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Fig. 3.4. Residuals time series (upper graph) and Normal quantile-quantile plot (lower 
graph). 
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3.4 Illustrative applications 


In this section, we illustrate some probabilistic statements made possible under 
the Bayesian framework. The joint posterior sample is used to simulate nonlinear 
functions of the model parameters. 


3.4.1 Persistence 


As pointed out in Sect. 3.2, a GARCH(1,1) process for {y,} is equivalent 
to an ARMA(1,1) process for {y?} with an autoregressive coefficient (a1 + 3) 
and a moving average coefficient —3. Consequently, the autocorrelation function 
(henceforth ACF) of the squared observations comes from the standard formulae 
for the ARMA(1, 1) model. It is recursively given by: 


ps = (01 +8) X pia 
for 7 > 1, where the first order autocorrelation is: 


. ar(1 — 6? — a1 8) 
PS T= B28 


The term (a; + 3) is the degree of persistence in the autocorrelation of the 
squares which controls the intensity of the clustering in the variance process. 
With a value close to one, past shocks and past variances will have a longer im- 
pact on the future conditional variance. An autoregressive coefficient (a1+3) = 1 
corresponds to a unit root process for squared observations. 

To make inference on the persistence and ACF of the squared process, we 
simply use the posterior sample and generate (al! + BU) as well as pil for 
j = 1,...,10’000 and 7 = 1,...,20. The posterior density of the persistence 
(a, + @) is plotted in the upper part of Fig. 3.5. The histogram is left-skewed 
with a median value of 0.865 and a maximum value of 0.992. In this case, the 
integration for the variance process is not supported by the data. In the lower 
part of the figure, we display the posterior ACF with its 95% and 99% confidence 
bands together with the sample autocorrelations of the squared observations. 
Although a single observation, at lag 11, lies outside the confidence bands, the 
autocorrelation structure of the estimated GARCH(1,1) model is in line with 
the data. 
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Fig. 3.5. Posterior density of the persistence (upper graph) and posterior autocor- 
relogram (lower graph) of the squared observations. Both graphs are based on 10’000 
draws from the joint posterior sample. 
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3.4.2 Stationarity 


In the case of the GARCH(1, 1) model with Normal innovations, Bollerslev [1986, 
Thm.1, p.310] and Nelson [1990, Thm.2, p.320] gave the conditions for covari- 
ance stationarity (CSC) and strict stationarity (SSC), respectively. These con- 
ditions are given by: 


CSC =a, +@-1<0 
SSC = E[In(are? + B)] <0 


where the error term ¢; is Normally distributed. As pointed out in Sect. 3.3, 
no stationarity condition has been imposed in the M-H algorithm. The joint 
posterior sample can therefore be used to estimate the posterior density of these 
functions by generating: 


Cscll = gl) + gil —1 


K 
ssclil = ~ ym (all (ql*ly? iy gi!) 
k=1 


for 7 =1,...,10’000, where 7K] is a draw from a standard Normal distribution 
and K is set large enough (we choose AK = 17000 in our application). In Fig. 3.6, 
we present the Gaussian kernel density estimates of the posterior densities for 
CSC and SSC. As we can notice, none of these values exceed zero in our sim- 
ulation study. Thus, the estimated model is covariance stationary and strictly 
stationary. 

We conclude this section by noting that other probabilistic statements on 
interesting functions of the model parameters can be obtained using the joint 
posterior sample. For instance, the posterior median is 0.341 for the uncon- 
ditional variance and 4.54 for the unconditional kurtosis. They approximately 
correspond to the sample estimations of 0.323 and 4.63. 
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— Covariance stationarity 
-- Strict stationarity 


Fig. 3.6. Posterior densities of the covariance stationarity and strict stationarity con- 
ditions. Gaussian kernel density estimates with bandwidth selected by the “Silverman’s 
rule of thumb” criterion [see Silverman 1986, p.48]. Both kernel density estimates are 
based on 10’000 draws from the joint posterior sample. 
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Bayesian Estimation of the Linear Regression 
Model with Normal-GJR(1,1) Errors 


“Overall, these results show a greater impact on volatility 
of negative, rather than positive return shocks.” 


— Robert F. Engle and Victor K. Ng 


In this chapter, we propose the Bayesian estimation of the linear regression 
model with conditionally heteroscedastic errors. In the context of time series 
regressions, the regression part can include exogenous or lagged dependent vari- 
ables. Moreover, we extend the traditional GARCH specification of the errors 
to account for asymmetric movements between the conditional variance and the 
underlying process. The volatility tends to rise more in response to bad news 
than to good news and this phenomenon is especially true on equity markets. 
This effect was first observed by Black [1976] and is referred to as the lever- 
age effect in the financial literature. One explanation of this empirical fact is 
that negative returns increase financial leverage which extends the company’s 
risk and therefore the variance. To cope with this stylized fact, we use the GJR 
model of Glosten et al. [1993]. In this setting, the conditional variance can react 
asymmetrically depending on the sign of the past shocks due to the introduction 
of dummy variables. The appealing aspect of this model is that it encompasses 
the symmetric GARCH. In addition, the MCMC scheme presented in Sect. 3.2 
can easily be extended for this asymmetric model in order to find proposal den- 
sities for the parameters. As a first illustration, we fit the model to the S&P100 
index log-returns and compare the Bayesian and the Maximum Likelihood es- 
timates. Next, we perform a prior sensitivity analysis and test the residuals for 
misspecification. Finally, we estimate the density of the unconditional variance 
of the process. 

The plan of this chapter is as follows. We set up the model in Sect. 4.1. 
The MCMC scheme is detailed in Sect. 4.2. The empirical results are presented 
in Sect. 4.3. We conclude with some illustrations of the Bayesian approach in 
Sect. 4.4. 
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4.1 The model and the priors 


A linear regression model with Normal-GJR(1,1) errors may be written as fol- 
lows: 


w= xy+u for t=1,...,T 
1/2 
hy! 


Ut = Et 
er © N(0,1) 
hy = a9 + (nl gu, 450} + Ol pu, 1 <o})Up_1 + ORe-1 


where aj > 0, a, > 0, ag > 0 and @ S O to ensure a positive conditional 
variance and hp = yo = 0 for convenience; y; is a scalar dependent variable; x, 
is am x 1 vector of exogenous or lagged dependent variables; + is am x 1 vector 
of regression coefficients; N(0,1) is the standard Normal density. In this setting, 
the conditional variance h; is a linear function of the squared past shock and 
the past variance but contrary to the GARCH model, the conditional variance 
can react asymmetrically to past shocks depending on their signs. The leverage 
effect is present if a2 > a, so that the conditional variance is higher after a 
negative shock than a positive shock. 

In order to write the likelihood function, we define the vectors y = (y1--- yr)! 
and @ = (ao @1 @2)/ as well as the T x m matrix X whose tth row is given by 
x}. We regroup the model parameters into 7) = (7, a, 3) for notational purposes 
and define the T x T diagonal matrix: 


y= dp) = diag ({hi()} #1) 
where: 
he (th) = ao + (ail guys (qy>0} + Call guy_ (7) <0}) U1 () + Bhia() 
ut(Y) — Yt — xy : 


Then, we regroup the error terms u;(7) into the T x 1 vector u = (u1--- ur)! 


and express the likelihood function of ~w as follows: 
L(w | y,X) « (det ©)-/? exp [—-gu’n""u] . (4.2) 


We propose the following proper priors on the parameters -y, a and ( of the 
preceding model: 
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PCY) = Nin(7¥ | By, D4) 
p(a) x N3(a | Hos Ua) pa>0} 
p(B) xX N(B | us, Ma)Igg50} - 


where jeg and 4, are the hyperparameters, O is a 3 x 1 vector of zeros, Tyo} is 
the indicator function and Nq is the d-dimensional Normal density (d > 1). In 
addition, we assume prior independence between parameters 7y, a and 3 which 
yields the following joint prior: 


Then, we construct the joint posterior density via Bayes’ rule: 


P(b |y,X) « Lib | y, X)p(y) . 


4.2 Simulating the joint posterior 


As in the GARCH model of Chap. 3, the recursive nature of the variance equa- 
tion does not allow for conjugacy between the likelihood function and the joint 
prior density. Hence, we rely again on the M-H algorithm to draw samples from 
the joint posterior distribution. We draw an initial value 7)!°! = (+!°, all, gl!) 
from the joint prior and we generate iteratively J passes for wy. A single pass is 
decomposed as follows: 


als ply alge yx) 
Bl ~ p(B | 7, a%l,y,X) . 


Since no full conditional density is known analytically, we sample the parameters 
‘y, a and @ from three proposal densities. 


4.2.1 Generating vector y 


The proposal density to sample the m x 1 vector 7 is obtained by combining 
the likelihood function (4.2) and the prior density by the usual Bayes update: 


ayy | 7,0, 8,y,X) = Nin(y | Hy, Ey) 


with: 
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GS-1 = y/S-1 -1 
uy =A A+, 
fi, =8,(x/ Sy + 55"n,) 
where the T x T’ diagonal matrix © = diag({he(¥, a, 8)}#_,) and ¥ is the 


previous draw of y in the M-H sampler. A candidate y* is sampled from this 
proposal density and accepted with probability: 


min fa | y,X) a(¥ | “BY X) i} 
P(Y,0,8|y,X) a(y* | ¥,a,8,y,X)’ 


4.2.2 Generating the GJR parameters 


The proposal densities to generate the parameters a and ( are obtained in 
the same manner as in Sect. 3.2. However, since we have a regression term 
which appears in the model, we estimate the GJR parameters from the errors 
Ut = Yt — XY instead of y,. An approximate likelihood function for (@, 3) is 
then constructed from the process {u?}. Note that in the case of a GJR model 
for {uz}, we do not end up with an ARMA process for {u?} as in the GARCH 
model since we have two dummy variables which appear in the expression of the 
conditional variance. Indeed, by defining w, = u? — hi, we can transform the 
expression of the conditional variance as follows: 


hy = 9 + (ailpu,_1>0} Wr aalltu, <0} Ut 1 + Bhe_1 


> uy = a9 + (ailyu,_,>0} + C2lltu,_,<0} + B)uz_, — Bur-1 + wr 


where w; can be written as w; = (x7 —1)ht, x? denoting a Chi-squared variable 
with one degree of freedom. As in the GARCH case, the sequence {w;} is a 
Martingale Difference process where the variable w,; has a conditional mean of 
zero and a conditional variance of 2h?. 

Following the methodology of Sect. 3.2, we approximate the variable w,; by 
a variable z; which is Normally distributed with a mean of zero and a variance 
of 2h?. This leads to the following auziliary model: 


ug = A + (alu, >0} + Malta. <0} + B)UZ_1 — Bai + % - 
Then, by noting that z; is a function of (a, 3) given by: 


x(x, 8) = uf — a9 — (orl gy, _,>0} + O2lty,_. <0} + B)ut_1 


7 GBzt-1(@, 3) 
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and by defining the the Tx 1 vector z = (z,--- zr)’ as well as the T x T diagonal 
matrix: 


A= A(q, 3) = diag({2h?(a, 8)}i-1) 


we can express the approximate likelihood function of (a, 3) as follows: 
L(a, 8B | y,y,X) x (det A)~1/? exp [-42/A“tz] . 


As will be shown hereafter, the construction of the proposal densities for pa- 
rameters @ and 7 is based on this approximate likelihood function. 


Generating vector a 


We extend the recursive transformations presented in Sect. 3.2.1 to express 
the function z;(a@, 3) in (4.3) as a linear function of the 3 x 1 vector a. Since the 
conditional variance is an asymmetric function of past errors, we have to dis- 
tinguish between positive and negative shocks in the recursive transformations. 
Let us define v4, = u? for notational convenience. The appropriate recursive 
transformations are then defined as follows: 


fR=14+81, 


ko * 
ve = Ur-1ltu,_,>0} + BUE-1 


eK Lt kK 
U = Vella, <0} + Buy; 


where /§ = vj = vo* = 0. As shown in Prop. A.2 (see App. A), upon defining 
the 3x1 vector c; = (If uf vi*)’, the function z, can be expressed as 2; = v,—C}.Q@. 
Then, by considering the T x 1 vector v = (v1 --- vr)! as well as the T x 3 matrix 
C whose tth row is c}, it turns out that z = v — Ca. Therefore, we can express 
the approximate likelihood function of parameter @ as follows: 


L(a| y,B,y,X) x (det A)~'/? exp [—3(v — Ca)/A7!(v — Ca)] . 


The proposal density to sample vector @ is obtained by combining this likelihood 
function and the prior density by Bayes’ update: 


Jo( Oe | 7,a, B,y, X) x N3(a | Ha La)l{a>0} 


with: 
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where the T’ x T diagonal matrix A = diag({2h?(y, &, 3)}7_,) and @ is the 
previous draw of @ in the M-H sampler. A candidate a* is sampled from this 
proposal density and accepted with probability: 


min { ee ly, X) da(@ | y,a*, B,y,X) i} 
p(y, @,8 | y,X) da(a* | -y, &, B,y,X)’ 


Generating parameter 3 


The methodology is the same as the one presented in Sect. 3.2.2. The function 
z:((3) given by: 


24(8) = uz — a9 — (Qilgu, 50} + Callgu,_; <0} + B)ug_1 + Bz-1(8) 


is approximated by a first order Taylor expansion at point B, the previous draw 
of parameter G6 in the M-H sampler. The proposal density qg(G | e) is then 
obtained by combining the approximate likelihood of 3 and the prior density 
via Bayes’ update. A candidate (3* is generated from this density and accepted 
with probability: 


nin | HO | y,X) a6(6 | 1.0, 8%,y,X) 7 | 
p(y, 8 | ¥,X) ga(8* | y,0,8,y,%) 


Finally, we conclude this section by noting that the validity of the algo- 
rithm and the correctness of the computer code are verified by the methodology 
detailed at the end of Sect. 3.2.2. 


4.3 Empirical analysis 


We apply our Bayesian estimation method to daily observations of the Standard 
& Poors 100 (henceforth S&P100) index log-returns. The sample period is from 
January 2, 1990, to December 17, 1992, for a total of 750 observations. The 
log-returns are expressed in percent. The S&P100 index is one of the major 
benchmarks of U.S. equity performance. It consists of one hundred stocks in a 
representative sample of leading companies chosen for market size, liquidity and 
industry group. We choose this data set since it is an equity index and is therefore 
susceptible to exhibit leverage effects. Moreover, a volatility index of the S&P100 
index, the VIX, is computed by the Chicago Board of Exchange. This index 
aims to give a fear gauge to investors and can be viewed as a proxy for the 
conditional variance. This volatility index gained particular attention in recent 
years as it provides an interesting asset for hedging downside market movements. 
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The two data sets are freely available from http: //www.finance.yahoo.com. 
Note that in 2003, the formula for the VIX’s calculation has been modified and 
the underlying index has changed from the S&P100 index to the S&P500 index. 
This is of no consequence here since we consider the old VIX definition which 
is based on the S&P100 index. 

The S&P100 index log-returns are displayed in the upper part of Fig. 4.1. In 
the lower part, the VIX index is plotted for the same time period. The correlation 
between log-returns and gross rates of squared VIX, which can be viewed as a 
proxy for the variance, is -0.52, indeed suggesting the presence of the leverage 
effect. 
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(in percent) 
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VIX level 
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Fig. 4.1. S&P100 index log-returns (upper graph) and VIX level (lower graph). 


4.3.1 Model estimation 


As noted by Campbell, Lo, and MacKinlay [1996, p.104] for instance, financial 
time series such as equity indices sometimes present positive first order auto- 
correlation. This effect is stronger for high-frequency data such as intra-daily 
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or even daily time series due to market micro-structures. Based on that evi- 
dence, we estimate the model (4.1) with a constant term and an autoregressive 


parameter: 


Ye = Yo + Y1Yt-1 + Ut (4.4) 


where the errors {u;} are modeled by the Normal-GJR(1,1) process intro- 
duced in (4.1). As a prior density for the regression parameters, we choose a 
bi-dimensional Normal density. In the case of the GJR parameters, the priors 
are truncated Normal densities. Both priors have zero mean vectors and diago- 
nal covariance matrices whose variances are set to 10’000 so we do not introduce 
tight prior information in our estimation (see Sect. 4.3.2 for a formal check). 
Finally, we recall that the joint prior is constructed by assuming prior indepen- 
dence between -y, a and (3. 

We run two chains for 10’000 passes each where only positivity constraints 
for the GJR parameters are implemented in the M-H algorithm, through the 
prior densities. We test the convergence of the chains using the diagnostic test 
by Gelman and Rubin [1992]. The convergence diagnostic based on the two 
chains shows no evidence against convergence of the sampler for the last 5’000 
iterations (the values of the 97.5th percentile of the potential scale reduction 
factor ranges from 1.01 to 1.09). The one-lag autocorrelations in the chains 
range from 0.30 for parameter 71 to 0.96 for 3G. The sampling algorithm allows 
to reach acceptance rates of 66% for vector a, 77% for -y and 95% for G. From 
the overall MCMC output, we discard the first 5’000 draws and merge the two 
chains to get a final sample of length 10’000. In addition, we estimate the model 
by the usual ML technique for comparison purposes. 

The posterior statistics as well as the ML results are reported in Table 4.1. 
First, we note a difference between the Bayesian and ML point estimates for 
the GJR parameters; the ML estimates are lower for the components of vector 
qa and higher for parameter (3. In the case of vector yy, the difference between 
the Bayesian and the classical approaches is more pronounced for the compo- 
nent 7. For both components of -y, the posterior 95% confidence bands are 
centered around zero which suggests a zero expectation for y; and no need to 
model an autoregressive component for this data set. Second, we note strong 
asymmetries in the marginal posterior densities for the parameters @ and £, 
as shown in Fig. 4.2. The marginal posteriors for components of vector @ are 
right-skewed while left-skewed for parameter (3. In the ML case, the asymptotic 
Normal approximation of the parameter estimates leads to negative values for 
the left boundary of ag and a,. Caution is therefore in order when applying 
the asymptotic Normal approximation in this context. Finally, the values of the 
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inefficiency factor (IF) reported in the last column of Table 4.1 indicate that 
using 10’000 simulations is appropriate if we require that the Monte Carlo error 
in estimating the mean is smaller than 0.33% of the variation of the error due 
to the data. The larger inefficiency factor reported for parameter (3 is reflected 


in a larger autocorrelation in the simulated values. 
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Table 4.1. Estimation results for the linear regression model with Normal- 
GJR(1, 1) errors.* 


7) UMLE 7) vos Wo.0o25 Yoo75 min max IF 

Yo 0.001 0.002 0.002 -0.025 0.028 -0.048 0.047 1.92 
-0.025,0.028] (0.190) 

Va 0.003 0.004 0.004 -0.075 0.084 -0.160 0.163 1.84 
-0.076,0.082] (0.546) 

Qo 0.018 0.023 0.022 0.012 0.041 0.006 0.071 16.52 
-0.008,0.028] (0.299) 

a4 0.022 0.038 0.034 0.002 0.096 0.000 0.193 3.45 
-0.020,0.064] (0.463) 

a2 0.155 0.180 0.175 0.101 0.286 0.058 0.393 3.96 
[0.077,0.234] (0.939) 

B 0.799 0.750 0.754 0.620 0.853 0.423 0.918 33.62 
[0.713,0.885] (3.45) 


* tbytp: Maximum Likelihood estimate; ~: posterior mean; we: estimated pos- 
terior quantile at probability 6; min: minimum value; max: maximum value; IF: 
inefficiency factor (7.e., ratio of the squared numerical standard error and the 
variance of the sample mean from a hypothetical cid sampler); [e]: Maximum 
Likelihood 95% confidence interval; (e): numerical standard error (x10*). The 
posterior statistics are based on 10’000 draws from the joint posterior sample. 


Given the expression of the scedastic function in (4.1), the leverage effect can 
be measured by Aa = (a2 — a). The posterior density of Aq is displayed in 
Fig. 4.3. The mean value is 0.142 and the median is 0.139, indicating a stronger 
impact of negative shocks to the conditional variance, as expected. Using the 
posterior sample, we can also estimate the probability of the presence of the 
leverage effect, i.e., P(Aa > 0 | y,X). The estimation gives a probability value 
of 0.998 with a 95% confidence band of [0.9976,0.9996]. Therefore, the data 


strongly support the presence of the leverage effect for the S&P100 index. 
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Parameter op 


T T T T T T T 
0.01 0.02 0.03 0.04 0.05 0.06 0.07 


Parameter o. 


Fig. 4.2. Marginal posterior densities of the GJR(1,1) parameters; upper graph: pa- 
rameter ao; lower graph: parameter a1. The histograms are based on 10’000 draws 
from the joint posterior sample. 
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Parameter o2 
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Fig. 4.2. (cont.) Marginal posterior densities of the GJR(1,1) parameters; upper 
graph: parameter a2; lower graph: parameter 3. The histograms are based on 10’000 
draws from the joint posterior sample. 


4.3 Empirical analysis 51 


Aa 


1500 + 


1000 + 


500 + 
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Fig. 4.3. Posterior density of the leverage effect parameter Aa = (a2 — a1). The 
vertical line is set at Aa = 0. The histogram is based on 10’000 draws from the joint 
posterior sample. 
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4.3.2 Sensitivity analysis 


As in Sect. 3.3.2, we test the sensitivity of our posterior results with respect 
to the choice of the prior density. We consider three alternative priors by either 
modifying the mean and/or increasing the variance relative to our initial prior. 
Formally, the alternative prior densities can be expressed as follows: 


P(Y) « No(y | pt2, 072) 
p(a) « N3(@ | 103, 0713)l paso} 
p(B) x N(B | 1,07)L peso} 


where tg is a d x 1 vector of ones, Ig is a d x d identity matrix, uz is the prior 
mean and o? the prior variance. 

The sensitivity results are reported in Table 4.2; the first two columns give 
the hyperparameters’ values of the alternative priors while the last column report 
the estimated Bayes factors. In all cases, the Bayes factors belong to the interval 
(0.3125, 1] which implies a weak evidence in favor of our initial specification 
relative to the alternative priors. This indicates that our initial prior is vague 
enough and does not introduce significant information in our estimation. 


Table 4.2. Results of the sensitivity 


analysis. * 


Alternative priors 


bb o BF 
1.00 10°000 0.999 
0.00 11000 0.751 
1.00 11000 0.751 


* The alternative priors are (truncated) 
Normal densities; 41 prior mean; a? prior 
variance; BF: Bayes factor. 


4.3.3 Model diagnostics 


We test the residuals for possible misspecification. The standardized residuals 
are defined by: 
je 
aA . Yt XY 
= a5 
he 
for t = 1,...,750, where ¥ is the posterior median of vector -y and hi is the 
conditional variance computed with the median of the posterior sample. If the 
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statistical assumptions in (4.1) are satisfied, these residuals should be indepen- 
dent and Normally distributed asymptotically. 

We test the residuals for autocorrelation using the Ljung-Box test up to lag 
20 [see Ljung and Box 1978]. The test does not reject the null hypothesis of 
the absence of autocorrelation at the 5% significance level (p-value = 0.365). 
This is also true for the squared residuals (p-value = 0.780). The Kolmogoroy- 
Smirnov Normality test does not reject the null hypothesis at the 5% significance 
level with a p-value of 0.0514. On the contrary, the Jarque-Bera Normality test 
strongly rejects the null. Hence, while the model is able to filter the hetero- 
scedasticity, it is not flexible enough to account for the high kurtosis of the 
residuals. This point will be addressed in Chap. 5 with the introduction of 
Student-t errors in the modeling. 


4.4 Illustrative applications 


We end this chapter with the estimation of the unconditional variance of the 
underlying process. Under model specification (4.4), the process is covariance 
stationary if the following conditions are satisfied: 

CSc; = 77 -1<0 

CSC2 = (@+ 8) -1<0 


where we define @ = ato for notational purposes. If both conditions are 
satisfied, the unconditional variance h, exists and is given by: 


——— 
o—CSCy x CSC, * 


The joint posterior sample can be used to estimate the posterior density of these 
functions by generating: 


j] j]\ 2 
osc}! = (yfl)* -1 


and then: 


pl] = 
* — escll x csc! 


for 7 = 1,...,10’000. In our simulation study, none of the values CSC, and CSC2 
exceed zero, thus indicating that the process is covariance stationary and that 
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the unconditional variance exists. The posterior density of the unconditional 
variance is displayed in Fig. 4.4 together with the ML asymptotic Normal 
approximation. The posterior mean and the posterior median are respectively 
0.169 and 0.1653. The value of the unconditional variance computed from the 
ML point estimates is slightly lower with a value of 0.1608. The 95% confidence 
band given by the Bayesian approach is [0.1373,0.2197]. In the classical ap- 
proach, the confidence band computed via the delta method is [0.0725,0.2491]. 
In this case, the asymptotic Normal approximation highly overestimates the 
size of the confidence band, especially the left part of the interval. As shown 
in Fig. 4.4, the asymptotic approximation is flat and symmetric whereas the 
posterior density is more peaked and exhibits a positive skewness (the skewness 
is 2.01 and significantly different from zero). 


hy 
- - Delta approximation 

20 4 Posterior density 
15 7 

10 

Ao |S 
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Fig. 4.4. Posterior density of the unconditional variance and asymptotic Normal ap- 
proximation. The histogram is based on 10’000 draws from the joint posterior sample. 
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Bayesian Estimation of the Linear Regression 
Model with Student-t-GJR(1, 1) Errors 


“This development (i.e., the Student-t distribution) 
permits a distinction between conditional 
heteroskedasticity and a_ conditional leptokurtic 
distribution, either of which could account for the 
observed unconditional kurtosis in the data.” 


— Tim Bollerslev 


In this chapter, we extend the linear regression model with conditionally hetero- 
scedastic errors. The conditional variance is again described by the GJR process 
introduced in Chap. 4. However, in the new specification, the errors are no 
longer Normally distributed but follow a Student-t distribution. Therefore, the 
model incorporates the possibility of heavy-tailed disturbances. Indeed, while 
the Normal distribution is used routinely, it has been widely recognized that 
financial markets exhibit significant non-Normalities, in particular asset returns 
exhibit heavy tails. A distribution with fat tails makes extreme outcomes such 
as crashes relatively more likely than does a Normal distribution which assigns 
virtually zero probability to events that are greater than three standard devi- 
ations. Since one of the objectives of financial risk management models is to 
measure severe losses, 7.e., events appearing in the tails of the distribution, this 
is a serious shortcoming and the alternative of the Student-t distribution is a 
parsimonious way to incorporate fat tails in the modeling. In the Bayesian ap- 
proach, the heavy-tails effect is created by the introduction of latent variables 
in the variance process as proposed by Geweke [1993]; this approach allows the 
Bayesian estimation of the degrees of freedom parameter in a convenient man- 
ner. As a first illustration, we fit the model to the S&P100 index log-returns 
and compare the Bayesian and the Maximum Likelihood estimations. Next, we 
perform a prior analysis and test the residuals for misspecification. Finally, we 
estimate the conditional and unconditional kurtosis of the underlying time se- 
ries. 
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The plan of this chapter is as follows. We set up the model in Sect. 5.1. 
The MCMC scheme is detailed in Sect. 5.2. The empirical results are presented 
in Sect. 5.3. We conclude with some illustrations of the Bayesian approach in 
Sect. 5.4. 


5.1 The model and the priors 


A linear regression model with Student-t-GJR(1,1) errors may be written as 
follows: 


Y= xiy+u, for t=1,...,T 


Ut = e(ohg yt? 
ez * S(0,1,v) (5.1) 


_yv-2 
OF 


Vv 


ht = an + (a1T¢u,_4>0} + Ol pu, 1 <0} )UE-1 + Bhi_-1 


where ag > 0, ay > 0, ag > 0, 6 S 0, v > 2 and ho = yo = O for convenience; 
yz is a scalar dependent variable; x; is a m x 1 vector of exogenous or lagged 
dependent variables; + is a m x 1 vector of regression coefficients; S(0,1,v) is 
the standard Student-t density with v degrees of freedom, 7.e., its variance is 
>4- From model specification (5.1) we note that @ is a scaling factor which 
normalizes the variance of the Student-t density so that h; is the variance of y; 
given by the GJR scedastic function. The restriction on the degrees of freedom 
parameter ensures the conditional variance to be finite and the restrictions on 
the GJR parameters guarantee its positivity. 

In order to write the likelihood function, we define the vectors y = (y1--- yr)’ 
and @ = (ag a1 G2)’ as well as the T x m matrix X of observations whose 
tth row is x}. For notational purposes, we regroup the model parameters into 


w = (y¥,a,G,v). In addition, we define the T x T diagonal matrix: 


D = D(p) = diag ({ohe(y, a, 8) }f1) 
where: 


hi(y, a; 8) = a0 + (il gu,_1(7) 0} + Olleuy_s (4) <0})UF_-1() 
+ Bhi-s(7, a, 8) (5.2) 
ur(Y) = Ye — XY - 


5.1 The model and the priors ov 


Then, we regroup the error terms u;(7) into the T x 1 vector u = (u1--- ur)! 


and express the likelihood function of ~ as follows: 
_ +1) 


a . -1/2 “ ut : 
ae (det ©) TT(+ Se.) . (5.3) 


L£s(v | y, X) x 


The notation £s5 emphasizes the fact that the likelihood function is constructed 
from the Student-t density. 

While the likelihood function (5.3) can be used in classical inference, it is 
not convenient in the Bayesian framework. It is indeed difficult to find proposal 
densities for the parameters, especially if we aim to sample the degrees of free- 
dom vy as well. To overcome this problem, we express the innovations process 
{uz} in another specification as proposed by Geweke [1993]. In this new setting, 
the variable u; is expressed as follows: 


Ut = er(woht)/? for t=1,...,T7 
a 
e, ~ N(0,1) (5.4) 
tid VV 
#15(53) 
mete ai 5 
Hence, the error term u;, conditional on w;, follows a Normal distribution with 
a mean of zero and a variance of w;oh;; the scedastic function oh; is multiplied 


by a latent variable @;, which is Inverted Gamma distributed. The degrees of 
freedom parameter v characterizes the density of a, as follows: 


vo) =($)" [tr ($)] ar ten |-4] . (5.5) 


Let us now regroup the latent variables into the Tx1 vector w = (w1-:: mr)’ 
and define the augmented set of parameters O = (w, @). Upon defining the TxT 
diagonal matrix: 


¥ = E(O) = diag({aohi(y, a, 8)}1) 


where h; is given in expression (5.2), we can express the likelihood function of 
@ as follows: 


L(O | y,X) x (det ©)~1/? exp [—4u/E“tu] . (5.6) 


This specification is equivalent to (5.3). However, in that case, the Bayesian 
estimation can be handled for all parameters in a convenient manner. 
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We propose the following proper priors on the parameters yy, a, 3 of the 
preceding model: 


py) = Nin (¥ | By, Dy) 
p(@) x N3(a | He, Ua) ta>o} 
p(B) « N(B | us, &a)I paso} 


where we recall that . and X, are the hyperparameters, I;,} is the indicator 
function, 0 is a 3x 1 vector of zeros and Nq is the d-dimensional Normal density 
(d> 1). 

The prior density of vector @ conditional on v is found by noting that the 
components @ are independent and identically distributed from (5.5), which 
yields: 


xoid=()" OI" (I=) o[-te2]. 


We follow Deschamps [2006] in the choice of the prior density on the degrees 


Tv 
2 


of freedom parameter. The density is a translated Exponential with parameters 
A>Oand 6 > 2: 
p(v) = Aexp | — A(v — 4) |Ipvs6} - 


For large values of A, the mass of the prior is concentrated in the neighborhood 
of 6 and a constraint on the degrees of freedom can be imposed in this manner. 
The Normality of the errors is obtained when 6 becomes large. As pointed out 
by Deschamps [2006], this prior density is useful for two reasons. First, it is 
potentially important, for numerical reasons, to bound the degrees of freedom 
parameter away from two to avoid explosion of the conditional variance. Second, 
we can approximate the Normality of the errors while maintaining a reasonably 
tight prior which can improve the convergence of the MCMC sampler. 

Finally, we assume prior independence between y, a, 3 and (w,v) which 
yields the following joint prior: 


P(O) = p(y)p(@)p(2)p(@ | v)p(v) 


and, by combining the likelihood function (5.6) and the joint prior, we construct 
the posterior density via Bayes’ rule: 


P(O |y,X) x LO |y, X)p() . 
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5.2 Simulating the joint posterior 


Once again, we rely on the M-H algorithm to draw samples from the joint 
posterior distribution. We draw an initial value: 


1 = (719), ql, gl), ao!9l, plo) 


from the joint prior and we generate iteratively J passes for O. A single pass is 
decomposed as follows: 


Al opp (elt, BU eg 11 ltl xy 
spe al GON sO yO ay xy 


( 

Tm pl 

lw p(p | yl, alfl, oli), ply, x) 
( 
( 


aL. 3s. Se 


Iw p(w | 7!) al), gl, DU-H y, x) 


vil w p(y oll) . 


Only vector @ can be simulated from a known expression. Draws of parameters 
‘, a and @ are made using a method similar to the one presented in Sect. 4.2. 
Sampling parameter v is more technical and relies on an optimized rejection 
technique. 


5.2.1 Generating vector y 


The proposal density to sample the m x 1 vector y is obtained by combining 
the likelihood function (5.3) and the prior density by Bayes’ update: 


dy(Y | Y,a,B,0,v,y,X) = Ninl¥ | Hy, Ley) 


with: 


nN 


pire Oo ie ere Dink 
fp, = 8,(X/ Sy +054n,) 
where the T x T diagonal matrix © = diag ({w0h1(¥, a, 8)}#_1), ¥ is the pre- 


vious draw of 7 in the M-H sampler and @ = as A candidate -y* is sampled 
from this proposal density and accepted with probability: 


min | ee | y, X) a(¥ | y*,a,8,@,V,y,X) i} 
p(y, a, 8, w,v | y,X) aly | VO Byars y 
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5.2.2 Generating the GJR parameters 


The methodology is similar to the one exposed in Sect. 4.2.2. Let us define: 


2 
_ Uj; 


Wt= ——h 
Tt 


where Tt = wo for convenience. From there, we can transform the expression 
of the conditional variance as follows: 


hy = A + (ailpu,_1>0} pe orl tu,» <o}) UE 1 + Bhy-1 


2 
U 
& a3 2 
& = = a + (Ail fuss >0} + O2lfur_s <op)Ue-1 
t 
2; 
UW 
t-1 
+ 3 Buwr-1 + wr 
Tt-1 


VU; = a9 + (Ail fu, 50} + O2ltu,_,<0}) 14-1 


+ Bur_1 — Bwr_1 + we 


2 
where we define vu, = - for notational purposes. Moreover, we note that the 
variable w; can also be expressed as follows: 


2 2 
w= ton (4 ) me = (=I) 


where yj denotes a Chi-squared variable with one degree of freedom. The last 
equality results from the expression for u, in (5.4). Therefore, the variable w; 
has a conditional mean of zero and a conditional variance of 2h?. In addition, 
we note that the sequence {w;} is again a Martingale Difference process. 

Approximating the variable w; by a variable z; which is Normally distributed 
with a mean of zero and a variance of 2h? yields the following auxiliary model 
for v¢: 


Uz = G9 + (ArT 4,150} + Cal guy. <o})TH-1Vt—-1 + BUi-1 


— But %. 
Then, by noting that z; is a function of (a@, 3) given by: 


ze(a@, 8) = ve — a0 — [(ilpu,_y>0} + Calta, <o})T#-1 + Bl UL-1 


5.7 
+ Bz-1(@, 8) a 


and by defining the T x 1 vector z = (z,--- zr)’ as well as the T x T diagonal 


matrix: 
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A = A(a, 8) = diag({2h; (a, 8)}1=1) 
we can approximate the likelihood function of (a, 3) as follows: 
L(a, 8 | y,o,y,X) x (det A)~'/? exp [—42/A71z] . (5.8) 


As will be shown hereafter, the construction of the proposal densities for pa- 
rameters @ and @ is based on this approximate likelihood function. 


Generating vector a 


Our aim is to express the function z;(@, 3) in (5.7) as a linear function of the 
3x1 vector a. To that aim, let us define the following recursive transformations: 


=1+ 6h, 


ee D * 
ve = Ut_Tfu,_1>0} + B%G_1 


Kk + D2 2K 
Ub = Usa. <0} + Buy, 


where /§ = vj = vo* = 0. As shown in Prop. A.3 (see App. A), upon defining 
the 3 x 1 vector c, = (lf uf u;*)’, it turns out that the function z, can be 
expressed as 2; = v; — c}a@. Hence, by defining the T x 1 vectors z = (21 --- zr)! 


/ 


and v = (v,--- vr)! as well as the T x 3 matrix C' whose tth row is c}, we get 


z= v-— Ca. Therefore, we can express the approximate likelihood function of 


parameter a@ as follows: 
L(a| +, 8,~,v,y,X) x (det A)~'/? exp [-4(v — Ca)'A“'(v—Ca)] . 


The proposal density to sample vector @ is obtained by combining this likelihood 
function and the prior density by Bayes’ update: 


n~ 


da(@ | an a, 3, a, v,y,X) x N3(a | Bas Ya)l peso} 
with: 


pie Oi a te ate 
fi = De(C’A*v + Dy Mg) 
where the T x T diagonal matrix A = diag({2h? (7, @, 3)}#_,) and @ is the 


previous draw of @ in the M-H sampler. A candidate a* is sampled from this 
proposal density and accepted with probability: 
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min | ee | y,X) da( | y,a*,3,7,V,y,X) i} 
p(a, 7, B,,V|y,X) qal(a* | 7,a,8,@,v,y,X)’ 


Generating parameter 3 


Contrary to the parameter a, we cannot express the function z,(a, 3) in (5.7) 
as a linear function of 3. To bypass this problem, we approximate the function 
z1(3) by a first order Taylor expansion at point (: 


2u(8) ~ (6) + St 


aB| X{8- 9) 


B=6 


where B is the previous draw of 3 in the M-H sampler. From there, we define 
the following: 


ry = 24(8) + BVe » Ve=z- as 


where the terms V; can be computed by the following recursion: 
Vi = vy — %4-1(8) + BVi-a 


with Vo = 0. This recursion is simply obtained by differentiating (5.7) with 
respect to 3. Then, we regroup these terms into the T x 1 vectors r = (r1--- rr)! 
and V = (V,--:Vr)’ and we approximate the term within the exponential 
in (5.8) by z ~ r—PV. This yields the following approximate likelihood function 
for parameter (3: 


L(B | Ya, w, v,y,X) x (det Ao exp [—3@ > BV) Ar _ Bv)| . 


This likelihood function is combined with the prior density by Bayes’ update to 
construct the proposal qg( | e). A candidate (* is sampled from this proposal 
density and accepted with probability: 


in | Bea 9.8 qe(B | y,0, 3*, @,v,y,X) 7 
ply; a; B, 27, v | y,%) ga(0* | 7,0, 0,07,9,%) 


5.2.3 Generating vector w 


The components of @ are independent a posteriori and the full conditional 
posterior of a; is obtained as follows: 
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p(w | , a, 3,v,y, X) x Le | y,X)p(w | v) 


=o) bt 
x wy; exp a (5.9) 
with: 
5 1 flue xn? 
t= T 
2 oht 
where we recall that hy = hi(y,a@, 3) and 9 = ae Expression (5.9) is the kernel 


of an Inverted Gamma density with parameters 4+ and by. 
2 


5.2.4 Generating parameter v 


Draws from p(v | a) are made by optimized rejection sampling from a translated 
Exponential source density. The target density is: 


pv | w) x p(@ | v)p(v) 
(2) [Geol ettesn 


with: 


T 
o= 5d (neta) +A. 


Following Deschamps [2006], we sample a candidate v* from a translated Expo- 
nential source density: 


g(v; B, 6) = fiexp [ — f(y — 6) |Tsey 


where jt maximizes the acceptance probability. The choice of jf is found by 


F L+po)\ | 1+ pd _ 


diInI(z) 
dz 


solving: 


for uw, where U(z) = is the Digamma function. The candidate v* is 


accepted with probability: 
; k(v*) 
p* SS Fe OE Pee ee 5.10 
5H 90s, 8) oy) 
where k(v) is the kernel of the target density: 


Ete 
2 


Ko (5) PG) Teoon 
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and s(u,6) is given by: 


cone ('58) bend] 


Cea elk a a) 
= T fu exp | 1 — ———| 
2p 2 Ll 


Substituting for k(v*), s(fi,6) and g(v*; ji, 6) in expression (5.10) yields: 


T ( ites a Tus =T(+f6) 
“=lrey| (2) Gar) * 
Pp — us —— 
D(4) 2 2p 


x exp G 5)(fi — ¢) ES i}. 


To end this section, we note that a slight modification of Geweke [1993] 
allows to generate draws from a Student-t distribution with conditional variance 
hy, without requiring the introduction of a scaling parameter 9 = v2 This is 
done by replacing the specification for the latent variable a; in (5.4) by: 


i y v—-2 


The use of this new specification requires some modifications of the efficient 
rejection scheme. We refer the reader to App. B for further details. 

Finally, we note that the validity of the algorithm and the correctness of 
the computer code are verified by the methodology detailed at the end of 
Sect. 3.2.2. 


5.3 Empirical analysis 


To illustrate our Bayesian estimation method, we fit the Student-t-GJR(1, 1) 
model to the data set used in the empirical analysis of Chap. 4. Based on 
previous results, we do not include the regression part in the current estimation. 


5.3.1 Model estimation 


As prior densities for the GJR parameters, we choose truncated Normal densities 
with zero mean vectors and diagonal covariance matrices whose variances are 
set to 10’000. For the prior on the degrees of freedom parameter, we set the 
hyperparameters to A = 0.01 and 6 = 2; the prior mean is therefore 102 and the 
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prior variance 10’000. Note that the value of the hyperparameter 6 is determined 
so that the conditional variance exists. Moreover, we recall that the joint prior 
is constructed by assuming prior independence between a, 3 and (@,v). 

We run two chains for 10’000 passes each and control the convergence of the 
sampler using the diagnostic test by Gelman and Rubin [1992]. The convergence 
diagnostic shows no evidence against convergence for the last 5’000 iterations 
(the value of the 97.5th percentile of the potential scale reduction factor ranges 
from 1.001 to 1.1). The one-lag autocorrelations in the chains range from 0.59 
for parameter a, to 0.97 for parameter v. The acceptance rate is 73% for vector 
a and 95% for parameter 3. The optimized rejection technique allows to draw 
a new value of v at each pass in the M-H algorithm. From the overall MCMC 
output, we discard the first 5’000 draws and merge the two chains to get a final 
sample of length 10’000. 

The posterior statistics as well as the ML results are reported in Table 5.1. 
First, we note that results for the GJR parameters are close to the results 
of Table 4.1 (see p.48). The posterior means of the parameters are slightly 
higher in the Student-t case (except for parameter ao) as well as the numerical 
standard errors. Second, the marginal posterior densities (not shown) are still 
clearly skewed and the 95% confidence band of the parameters obtained through 
the asymptotic Normal approximation leads to a negative left boundary for 
component a;. The ML point estimate for the degrees of freedom parameter 
is 9.9 while the posterior mean is 7.15 and the posterior median is 7.11. This 
low value indicates a departure from Normality for the errors. In addition, the 
95% confidence band given by the ML approach is much wider than the one 
estimated via the Bayesian approach. The left boundary is 0.14 which rejects the 
existence of the conditional variance. In the case of the Bayesian estimation, the 
minimum value for the degrees of freedom is 3.84, which supports the existence 
of the conditional variance. The values of the inefficiency factor (IF) range from 
3.65 for parameter a, to 111.98 for parameter v, indicating that in the worst 
case, the numerical errors represent about 1.12% of the variation of the errors 
due to the data. 

In Fig. 5.1, we present a comparison between the classical and the Bayesian 
approaches. The upper graphs show a scatter plot of the draws from the asymp- 
totic Normal approximation of the model parameters; the Normal density is 
centered at the ML estimates que and its covariance matrix is estimated as 
the inverse of the Hessian matrix evaluated at Wir. The lower graphs present a 
scatter plot of draws from the joint posterior sample. In both cases, the number 
of draws is 10’000. The first part of the figure depicts the draws for (ao, 3). By 
comparing the ML and Bayesian outputs, we can notice a clear difference in the 
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Table 5.1. Estimation results for the Student-t-GJR(1, 1) model.* 


w YMLE wo wo. Wo.025 Wo.o75 min max IF 

Qo 0.012 0.018 0.017 0.008 0.036 0.003 0.060 18.15 
[0.002,0.021] (0.314) 

a1 0.026 0.041 0.037 0.008 0.107 0.000 0.194 3.65 
[-0.015,0.067} (0.516) 

a2 0.153 0.203 0.194 0.105 0.350 0.055 0.533 6.93 
[0.063,0.242] (1.641) 

B 0.834 0.776 0.785 0.634 0.870 0.408 0.908 54.98 
(0.740,0.929] (4.636) 

Vv 9.90 7.549 7.11 4.54 13.60 3.84 18.85 111.98 


[0.14,19.65] (236.50) 


* Uwe: Maximum Likelihood estimate; 7: posterior mean; we: estimated posterior 
quantile at probability ¢; min: minimum value; max: maximum value; IF: ineff- 
ciency factor (i.e., ratio of the squared numerical standard error and the variance of 
the sample mean from a hypothetical iid sampler); [e]: Maximum Likelihood 95% 
confidence interval; (e): numerical standard error (x10?). The posterior statistics 
are based on 10’000 draws from the joint posterior sample. 


tails of the joint density. Indeed, the Bayesian posterior exhibits larger values 
for parameter ap together with lower values for parameter (3. In addition, when 
drawing a vertical line at ag = 0, we note that some draws are negative with 
the asymptotic Normal approximation. In the posterior sample, the draws are 
positive as this is required by the prior density. In the second part of Fig. 5.1, 
the two graphs show the draws for (a2, 3). For these parameters, the posterior 
sample exhibits a clear departure from the ellipsoid shape obtained with the 
Normal approximation. 

In Fig. 5.2, we display the prior and the posterior densities of the degrees 
of freedom parameter. While the prior density is almost flat (we recall that 
the hyperparameters are set to \ = 0.01 and 6 = 2 so that the prior mean is 
102 and the prior variance 10’000), the shape of the posterior density is peaked 
and concentrated around its mean value. In addition, the density is significantly 
right-skewed. 
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Fig. 5.1. Comparison between the ML (upper graph) and the Bayesian (lower graph) 
approaches. For both graphs, the number of draws is 10000. 
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Normal approximation 
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Fig. 5.1. (cont.) Comparison between the ML (upper graph) and the Bayesian (lower 
graph) approaches. For both graphs, the number of draws is 10’000. 
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Fig. 5.2. Prior and posterior densities of the degrees of freedom parameter. The 
vertical line is centered at v = 4, the value required for the conditional kurtosis of the 
errors to exist. The histogram is based on 10’000 draws from the posterior sample. 
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5.3.2 Sensitivity analysis 


As in previous chapters, we test the robustness of our results with respect to 
the choice of the prior density. To that aim, we consider the same alternative 


priors of Sect. 4.3.2 for parameters a and (3: 


p(a) « N3(@ | j13, 0°13) I paso} 
p(B) XN (B| p,07)L peso} 


where we recall that e3 is a 3 x 1 vector of ones, [3 is a 3 x 3 identity matrix, fu is 
the prior mean and o? the prior variance. For the alternative prior on the degrees 
of freedom parameter, we consider a translated Exponential with A = 0.008 and 
6 = 2, which implies a prior mean of 127 and a prior variance of 15’625. 

The results of Table 5.2 indicate that the prior on the degrees of freedom 
has the largest impact on Bayes factors. Moreover, in all cases we conclude to a 
weak evidence in favor of the initial specification relative to the alternative priors 
since the Bayes factors belong to the interval [0.3125, 1]. This indicates that our 
initial prior is vague enough and does not introduce significant information in 


our estimation. 


Table 5.2. Results of the sensitivity analysis.* 
Alternative priors 
m o r BF 
1.00 10’000 0.01 1.000 
0.00 11’000 0.01 0.826 
0.00 10’000 0.008 0.809 
1.00 11’000 0.008 0.668 


* The alternative priors on the parameters a and (3 
are truncated Normal densities; j prior mean; 0? prior 
variance; The alternative prior on the parameter v is 
a translated Exponential with hyperparameters \ and 
6 = 2; BF: Bayes factor. 


5.3.3 Model diagnostics 


We test the standardized residuals for possible model misspecification. The 
Ljung-Box test does not reject the absence of autocorrelation in the residuals 
at the 5% significance level (p-value = 0.4727). This is also true for the squared 
residuals (p-value = 0.8724). The Kolmogorov-Smirnov Normality test slightly 
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rejects the Normality assumption at the 5% significance level with a p-value 
of 0.0437. However, when comparing the standardized residuals to a Student-t 
distribution whose degrees of freedom parameter is set to the posterior median 
Vv = 7.11, the Kolmogorov-Smirnov empirical distribution test does not reject 
the null hypothesis at the 5% significance level (p-value = 0.4296). Hence, the 
model accounts for the conditional heteroscedasticity and for the high kurtosis 
in the residuals. 


5.4 Illustrative applications 


We end this chapter by illustrating some probabilistic statements made on the 
conditional and unconditional kurtosis of the underlying process. Under speci- 
fication (5.1), the conditional kurtosis «- is defined as follows: 


. 3(v — 2) 


_— 
y—A4 


provided that v > 4. Using the joint posterior sample, we estimate the posterior 
probability of the existence for the conditional kurtosis, P(v > 4 | y, X), to 0.999. 
Therefore, the existence is clearly supported by the data. The posterior mean 
of the kurtosis is 6.82 and the 95% confidence interval is [3.72,13.8], indicating 
heavier tails than for the Normal distribution. 

Finally, we extend the analysis to the unconditional kurtosis of the process. 
Let us define @ = “1% for notational convenience. As demonstrated by He 
and Teréaisvirta [1999], the expression of the unconditional kurtosis K, is given 
by: 


Ke(L+@+ 8)(1- a — 8) 
1 4A) — 26(a+ f) 


Ky = 
provided that «- is finite and: 


Ke (aj + 3) 
2 


The posterior probability of the latter condition is 0.007, meaning that there is 
a 0.7% chance that the unconditional kurtosis exists. 
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Value at Risk and Decision Theory 


“Density forecasting is fast becoming an important tool 
for decision makers in situations where loss functions 
are asymmetric and forecast errors follow non-Gaussian 
distributions.” 


— Allan Timmermann 


6.1 Introduction 


Since the Group of Thirty report in 1996, the Value at Risk (henceforth VaR) 
has become the corner-stone in any risk management framework and is essential 
in allocating capital as a cushion for market risk exposures. This measure gives, 
for a given time horizon and a given confidence level ¢, the portfolio’s loss that 
is expected to be exceeded with probability ¢° = (1 — ¢). The VaR is in many 
aspects an attractive measure of risk, being relatively easy to implement and 
easy to explain to non-expert audiences. While primarily designed for market 
risk exposures, the VaR methodology now underpins the credit and operational 
risk recommendations. From the internal models approach endorsed by the Basel 
Committee on Banking and Supervision of Banks for Internal Settlement [see 
Basel Committee on Banking Supervision 1995] and later adopted by US bank 
regulators, banks are allowed to use their own models to estimate the VaR and 
keep aside regulatory capital. 

From a statistical viewpoint, the VaR is nothing else than a given percentile 
of the profit and loss (henceforth P&L) distribution over a fixed horizon. To 
be acceptable by regulators, the confidence level must be 99% and the holding 
period must be two weeks (i.e., ten trading days). This is motivated by the fear 
of a liquidity crisis where a financial institution might not be able to liquidate 
its holdings for ten days straights. However, market participants consider the 
99% confidence level and the two weeks horizon to be too conservative. As an 
additional tool for internal risk controlling, both the holding period and the 
confidence level can be selected to fit the needs of analysts; in practice, it is 
common to limit the confidence level to 95% and the holding period to one day. 
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Evidently VaR can only be constructed by statistical methods. But in most 
applications, the true P&L distribution is not known and VaR can only be 
estimated from sample data. The underlying assumption of all the VaR estima- 
tion methods is that the risk associated with a particular portfolio for a fixed 
time horizon is encapsulated within the P&L distribution. If this distribution is 
known, the VaR can be obtained directly by reading the appropriate percentile 
value from this distribution. If the P&L is unknown, it must be estimated. As 
noted by McNeil and Frey [2000, p.272]: 


(...) “the existing approaches for estimating the P&L distribution can 
be divided into three groups: the non-parametric historical simulation 
method; fully parametric methods based on an econometric model for 
volatility dynamics and the assumption of the conditional distribution, 
e.g., GARCH models; and finally methods based on extreme value the- 


ory.” 


We focus on the second approach in the current application. 

Within the fully parametric literature, many papers either attempt to fore- 
cast the VaR at different time horizons or use the VaR to assess the forecasting 
performance of a particular model. In both cases, the methodology is the same. 
First, a statistical model which describes the P&L dynamics is determined. The 
model parameters are estimated by the Maximum Likelihood technique for a 
given estimation window. Then, based on these estimations, a VaR point fore- 
cast is determined for a given horizon. The procedure is repeated again over a 
testing window by rolling the estimation window; in this manner, we obtain a 
time series of VaR. forecasts. Then, the model is backtested, i.e., the predictions 
are compared with the realized P&Ls. Often, a statistical test is used to assess 
the performance of the model, i.e., to determine whether the model captures 
the true VaR |see, e.g., Christoffersen 1998, Kupiec 1995). 

While this methodology is accepted by academics and is widely implemented 
in practice, we note that few empirical studies account for the uncertainty in 
the VaR predictions. Nevertheless, this issue is important in a risk management 
framework where some measure of the forecasts’ accuracy is also needed; assess- 
ing the uncertainty of the VaR will allow the portfolio managers to make more 
informed decisions when dictating a portfolio re-balance, for instance. We may 
distinguish two sources of uncertainty which can influence the VaR accuracy: 


e The parameter uncertainty within the context of a given model; 


e The model uncertainty; via a probability function defined on a class 
of M possibly non-nested models M; (¢ = 1,...,M). 
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The former source of uncertainty, also referred to as estimation risk, is straight- 
forwardly handled in Bayesian inference since the complete characterization of 
the parameter uncertainty is contained in the joint posterior. The latter, known 
as model risk, is a natural concept in the Bayesian framework. However, in 
practice, its estimation involves many difficulties. In effect, the methodology re- 
quires the estimation of the model likelihood p(y | M;) which can be difficult 
to estimate. Several estimation methods have been proposed but their cost is 
far from negligible. In addition, the method may not work properly and can be 
sensitive to the choice of the prior density. For these reasons, we concentrate 
our attention on the estimation risk where the parameter uncertainty is used to 
determine the VaR density instead of a single VaR point estimate. 

Some approaches have been proposed to quantify the VaR uncertainty when 
the P&L dynamics is described by GARCH models. Basically, these techniques 
rely on the bootstrap methodology as in Christoffersen and Gongalves [2004] or 
on some asymptotic justifications as in Bams, Lehnert, and Wolff [2005]. The 
former approach is computationally very demanding since at each step in the 
procedure, a GARCH model is fitted to the bootstrapped data. While technically 
more convenient, the latter approach relies on an asymptotic approximation of 
the distribution of the parameter estimates. The Bayesian approach gives a nat- 
ural answer to these problems, as noted by Miazhynskaia and Aussenegg [2006]. 
As will be shown hereafter, the s-day ahead VaR (s > 1) can be expressed as a 
function of the model parameters; hence, for each parameter in the joint poste- 
rior sample, we can obtain a VaR point forecast. By repeating the estimation for 
each draw in the posterior sample, we obtain an estimation for the VaR density 
itself. When the forecast horizon is one day, the Bayesian approach gives the ex- 
act VaR density. For forecasting horizons larger than one day, an approximation 
based on the first four moments of the future P&L density can be obtained. 

In the Bayesian framework, we can either integrate out the parameter un- 
certainty or choose a Bayes point estimate within the VaR density. The former 
case is achieved by simulating from the predictive density and estimating the 
VaR from empirical percentiles. The latter case yields an interesting problem 
of decision theory: the choice of a Bayes point estimate which is optimal given 
a particular loss function. In decision theory, the common practice is to use a 
symmetric squared error loss function. While this loss function is appropriate in 
many statistical applications, it may however not be flexible enough for finan- 
cial purposes, where over- and underestimation may have different consequences. 
Hence a flexible asymmetric loss function is required. 

The contributions of this chapter to the existing literature are as follows. 
First, we provide a manner to approximate the multi-day ahead VaR density 
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when the underlying process is described by a GARCH model. Since this class 
of models is a workhorse in financial risk management, we therefore give the 
possibility to determine the VaR term structure and to characterize the uncer- 
tainty coming from the parameters. Second, we give a rational justification to 
the choice of a point estimate within the VaR density based on the decision 
theory framework. We document how agents facing different risk perspectives 
can select their optimal VaR point estimate and show that the differences across 
agents (e.g., fund and risk managers) can be substantial in terms of regulatory 
capital. Lastly, we extend our methodology to the Expected Shortfall alternative 
risk measure, and show that our simulation procedure can also be applied in a 
straightforward manner. 

The plan of this chapter is as follows. In Sect. 6.2, we formally define the 
concept of VaR and derive the s-day ahead VaR expression under GARCH 
dynamics. In Sect. 6.3, we review some fundamentals of Bayesian decision 
theory and introduce the asymmetric Linex loss function. In Sect. 6.4, we 
propose an empirical application with the estimation of the VaR term structure. 
Finally, we extend the methodology to the Expected Shortfall risk measure in 
Sect. 6.5. 


6.2 The concept of Value at Risk 


In this section, we formally define the concept of VaR and determine the density 
of the one-day ahead VaR under the GARCH(1, 1) dynamics with both Normal 
and Student-t disturbances. The density of the VaR for time horizons larger 
than one day is obtained by explicitly estimating the first four moments of the 
conditional P&L density and approximating the percentile of interest by either 
using the Cornish-Fisher expansion [see Cornish and Fisher 1937] or a Student-t 
approximation. We consider the GARCH(1, 1) model for ease of exposition but 
the methodology can be extended, upon modifications, to higher order GARCH 
models as well as asymmetric specifications. 


Definition 6.1 (Value at Risk). Let Y be a univariate random variable (not 
necessarily continuous) with a distribution function Fy. For a given risk level 
@, which belongs for risk management purposes to the interval [0.90, 0.995], the 
VaR. of Y is defined by: 


VaR? = inf {y € R| Fy(y) > ¢°} 


where 6° = (1 — ¢) for notational purposes. 
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Hence, the VaR is nothing else than a percentile of the distribution of Y. When 
the variable Y follows a standard Normal distribution, the VaR with confidence 
level ¢ is the #°th percentile denoted by zc; €.g., 20.95 = —1.64. When the 
variable Y follows a standard Student-t distribution with v degrees of freedom, 
the $°th percentile is denoted by tgc(v); e.g., to.95(5) = —2.01. We emphasize 
the notation in the Student-t case where the VaR depends on the parameter v. 


6.2.1 The one-day ahead VaR under the GARCH(1, 1) dynamics 


Under a GARCH(1, 1) model with Normal disturbances, the one-day ahead VaR. 
at risk level @, estimated at time t, is given by: 


VaR? (wy) = hilt (a, B) x z¢¢ 


where w = (a,() and hz41 is the conditional variance which is computed by 
recursion given F;, the information set at time t. Hence, under Normal distur- 
bances, the one-day ahead VaR is nothing else than a given percentile of the 
standard Normal distribution scaled by the conditional standard deviation. 

In the case of Student-t disturbances, the one-day ahead VaR at risk level 
@, estimated at time t, is given by: 


VaR? (w) = [o(v) x hess (ar, 8)]/” x tye(v) 


where in this case 7 = (a@,,v). In addition to the scale factor o(v) = <*, 
the ¢°th percentile of the standard Student-t distribution depends on the model 
parameter v. For both Normal and Student-t cases, the joint posterior sample 
can be used to simulate the density of the one-day ahead VaR at any confidence 


level ¢. 


6.2.2 The s-day ahead VaR under the GARCH(1, 1) dynamics 


If the horizon is larger than one day, predictions for the cumulative returns 
are needed, which in turn requires multi-step predictions. The cumulative re- 
turns over an s-day horizon (starting at time t) henceforth denoted as y;,;, are 
straightforwardly calculated from the single period log-returns y,4; (¢ = 1,...,8) 
as: 

Yt,s = Yet + Ytt2 +... + Ytts - 


This follows from the definition of the one-day log-return, calculated as the 
logarithmic difference of asset prices. 

As in the one-day ahead case, our aim is to express the VaR of the variable 
Yt,s aS a function of the model parameters ~. However, it is well known that 
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under GARCH dynamics, no expression in closed form exists for the density of 
Yt+i When i > 1; hence no closed form expression is available for the density of 
Yt,s either. To overcome this problem, we might use Monte Carlo simulations 
to generate the density of interest. That is, for a given set of parameters ~ and 
information set F;, we could simulate B paths for the P&L over s days. The 
VaR would then be approximated by the empirical percentile of the distribution. 
In order to obtain a density for the VaR itself, this evaluation would have to be 
handled for each w in the joint posterior sample. However, since the quantity 
of interest describes the tail of the distribution, a large amount of simulations 
B would be required to get an accurate VaR estimate, which might lead to an 
extremely costly simulation scheme. 

Therefore, in order to simplify and accelerate the estimation procedure, we 
propose an approximation of the VaR based on the first four conditional mo- 
ments of the variable y;,, which can be calculated analytically when w is known. 
To that aim, let us define the pth conditional moment of y;,, as follows: 


Kp() = “tb (uf) 


where E;y(e) = E(e | w,F;,) is the conditional expectation given ~ and F;. 
The notation «,(~) emphasizes the fact that the pth conditional moment is a 
function of w and the time index is suppressed to simplify the notation. The 
explicit calculation of the first four moments is possible using the multinomial 
formula which gives the pth power of the cumulative return as follows: 


s P 
Vins = (>: vi) 
i=1 


p! ; i 
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U1, +++ 5ts 
tit... +is=p 
Calculations given in Props. C.1 and C.3 (see App. C) show that under the 
GARCH(1,1) specification, the first and third conditional moments are zero 
which implies that the conditional density of y;,, is symmetric around zero. Cal- 
culations for the second conditional moment of y;,, in Prop. C.2 (see App. C) 
yield: 


s 


kaw) = S~ Ex y(he+:) 


i=l 


where: 
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‘eb( hess) = Oo + pr Ee, (he+i-1) (6.1) 


and p; = (a, + 2). Expression (6.1) can be evaluated recursively from 


tw (Re+1) = ht+i1(w) since this value is known given w and F;. For the fourth 
conditional moment, the calculations in Prop. C.4 (see App. C) yield the 
following expression: 


8 s-l 8 
Ka() = Ke be,p (Né4:) at oy: > be, (Yes Vi+3) (6.2) 
i=1 i=1 j=i+1 
where: 
Deb (he pi) = 06 + 71 Ee,y(hepi-1) + 72 Bey (he pi-1) (6.3) 
and: 


, Dp Ve ji , 
tw (Yeré Yirj) = ao (=) bes (he+i) + py : p2 tw (Ress) : (6.4) 


In expression (6.2), the parameter «- denotes the fourth moment of the distur- 


bances in the GARCH(1, 1) process; in the case of Normal disturbances, Kz = 3, 


while for (scaled) Student-t disturbances, Ke = 3?) The parameters 71, 72 


and p2 are functions of the set of parameters w, respectively given by: 


T1 = 2ag(a1 + £8) 
T2 = Keay + B(2a1 + B) 


and: 


p2= Keay +f. 


The conditional expectations of expressions (6.3) and (6.4) can be evaluated 


recursively from E¢.y(hi+1) = Arzi(w) and Eyy(h?,,) = h?,,(~) since these 


values are known given w and F;. 

As previously stated, the conditional moments «; (i = 1,...,4) are used to 
estimate the percentile of the conditional density of the cumulative return over 
an s-day horizon. We propose two approaches to determine this percentile. The 
first method is the well-known Cornish-Fisher expansion by [see Cornish and 
Fisher 1937] which consists in a transformation of the percentile of the standard 
Normal density to account for non-zero skewness (i.e., asymmetry) and excess 
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kurtosis (7.e., fat tails). In our context, «; and «3 are zero so that the Cornish- 
Fisher formula simplifies. We thus obtain the following approximation for the 
s-day ahead VaR (s > 1) at risk level ¢, estimated at time t: 


VaR? (ap) © #9/7() x | age + i ( — 3z¢¢) (a 7 3) (6.5) 
where we recall that ¢° = (1— ¢) and z¢e is the ¢°th percentile of the standard 
Normal distribution. From expression (6.5), we can notice the impact of the 
excess kurtosis (3 — 3) on the VaR. Since conditional moments are functions 
w, so is the VaR. 

The Cornish-Fisher expansion is widely used in practice due to its simplicity, 
and its accuracy is sufficient in many situations, especially when the distribution 
of interest is close to the Normal. In this case, the Cornish-Fisher expansion pro- 
vides a small correction for the non-zero skewness and excess kurtosis. However, 
as pointed out by Jaschke [2002], the Cornish-Fisher expansion may suffer from 
important deficiencies in pathological situations, for instance, when the kurtosis 
of the distribution we aim to approximate is high. We note in particular that: 


e The approximation may yield a distribution which is not necessar- 
ily monotone. Hence, we may be faced with situations where the 
risk capital allocated for a 1% chance event would be lower than 
the capital allocated for a 5% chance event! 


e The approximation has the wrong tail behavior, t.e., the Cornish- 
Fisher approximation for the VaR at risk level @ becomes less and 
less reliable for ¢ — {0,1}. 


These drawbacks can have serious consequences for risk management systems 
and we propose therefore a second method to approximate the percentiles of 
interest. Since the density we aim to approximate is symmetric around zero, we 
simply fit a Student-t density to the second and fourth conditional moments k2 
and 4. First, we determine the conditional kurtosis of y;,;, denoted by &, as 


aap iy _ hal) 
BP) = ab) ° 


From there, we estimate the degrees of freedom parameter 7 of the Student-t 


follows: 


density. The relation between & and 7 is given by: 


_ 6-42) 


WO) = SR) 
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Finally, the VaR is estimated by the appropriate percentile of the standard 
Student-t density scaled by the conditional standard deviation ky! *| This yields 
the following approximation for the s-day ahead VaR (s > 1) at risk level 4, 
estimated at time t: 


Bp) = 2)" 
var? (u) = (A) x toe (QW)) x7). (68) 
This approximation is a function of the set of parameters w. Hence, as with the 
Cornish-Fisher approximation, the density of the VaR can be estimated at low 
cost by simulating from the joint posterior sample. 

In the Bayesian context, we can integrate out the parameter uncertainty 
to end up with a single VaR point estimate. This problem is solved by the 
estimation of the predictive density which is defined as the density of future 
s-day ahead observations, y;., = (Yi+1°:-Yt+s)’ for s > 1, conditioned on past 
observations yo. = (y1--- ye)’ (also denoted by F;), but marginalized over 4). 
More formally, the predictive density is defined as follows: 


P(Yt:s | Yo) = [res 1d, Yoxu)P(Y | Youdy (6.7) 


where p(y;:, | Y,¥o.4) is the conditional density of y;., given (w~,yo.,) and the 
marginalization is with respect to the posterior density p(w | yo.,). In general, 
the predictive density is not available in closed form. However, one can use the 
posterior sample in conjunction with the method of composition to produce a 
sample of draws from the predictive density. We simulate a draw yl from the 
density p(y;.,,¥ | Yo.) as follows: 


; (6.8) 
yi ~ DWYVtzs | Wl You) 


where the second step in the simulation process is possible by using the method 
of composition: 


8 


Pts | Ys Yor) = [[o@u | W,Yo.(t+i-1)) : (6.9) 


i=l 


The collection of simulated values fy Pies is generated from the predictive 


density in (6.7) and the predictive VaR is a percentile of this density. For the 
one-day ahead VaR, we only need to consider the first component of vector yl 
whereas for the s-day ahead VaR, we must sum the components of yl to sim- 


ulate the predictive density for y;,;. From (6.7), we notice that the predictive 
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VaR is a quantile of a mixture density. Therefore, it can be viewed as an exten- 
sion of the case where there is no parameter uncertainty (7.e., w is constant); 
in this case, the predictive VaR would simply be estimated by a percentile of 
PY ts | Ys Yo:t)- 

To end this section, we illustrate the quality of the Cornish-Fisher and 
Student-t approximations through a simulation study. To that aim, we estimate 
the GARCH(1, 1) model with Student-t disturbances for the 750 Deutschmark 
vs British Pound foreign exchange log-returns used in the empirical analysis of 
Chap. 3. We arbitrarily select a set of parameters w = (a, @,v) in the joint 
posterior sample: 


eel |) = BLOC saa ose 
0.297 


and estimate the VaR using formulae (6.5) and (6.6) for s = 10 and ¢ ranging 
from 0.001 to 0.999 with a step size of 0.001. This procedure allows to draw two 
approximations for the distribution of y750,10. These distributions are compared 
with the distribution obtained by simulating 10’000 paths of the process over ten 
days, using (6.9). Under the model specification and the selected W, we obtain 
kg = 3.8 and &4 = 492, implying a conditional kurtosis * = 34 and a degrees of 
freedom parameter 7 = 4.2. The simulated distribution clearly exhibits heavier 
tails than the Normal distribution. 

On the left-hand side of Fig. 6.1, we display the two approximations to- 
gether with the distribution obtained by simulation. The Cornish-Fisher ap- 
proximation is shown in dotted line, the Student-t approximation in dashed line 
and the empirical distribution in solid line. From this figure, it is almost im- 
possible to distinguish the approximation based on the Student-t distribution 
from the simulated distribution. In contrast to this, the Cornish-Fisher expan- 
sion produces a S-shaped, non-monotone distribution. The four shaded regions 
delimit the extreme quantiles, at risk level ¢ € {0.01, 0.05, 0.95, 0.99}. We notice 
that the approximations for ¢ € {0.05,0.95} are quite similar for the Cornish- 
Fisher and the Student-t approaches. However, the difference is substantial in 
the case where ¢ € {0.01,0.99}. In the middle graph of Fig. 6.1, we show a 
zoom of the previous graph over the domain [2,6] x [0.94, 1]. We can see that 
the Student-t approximation fits the distribution of interest well. Hence, in this 
particular example, the graphical comparison indicates that the Cornish-Fisher 
expansion fails in approximating the distribution of interest. On the other hand, 
the approximation based on the Student-t distribution seems to provide an ad- 
equate approximation of the whole distribution. To complete the simulation 
study, we display, on the right-hand side of Fig. 6.1, the difference between 
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the simulated distribution and the Student-t approximation as a function of @°. 
The dotted lines delimit the 95% confidence band for the simulation, estimated 
by replicating 500 times the empirical distribution. We note that the difference 
lies within the [—0.1,0.1] interval for risk levels ranging from 0.05 to 0.95. For 
other percentiles, the error increases together with the width of the confidence 
band. However, the confidence interval still contains the value of zero, indicating 
a good approximation in the tails too. 

Finally, we note that other approximation methods of the whole density 
for yz,5 can be obtained [see, e.g., Highfield and Zellner 1988]. This is of inter- 
est when the density we aim to approximate is skewed, since in this case, the 
Student-t approximation would fail. Such asymmetric densities arise, e.g., with 
asymmetric GARCH models [see Engle 2004, p.415]. 
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Fig. 6.1. Cornish-Fisher and Student-t approximations. On the left-hand side, we display the distribution given by the Cornish-Fisher (in 
dotted line) and the Student-t (in dashed line) approximations together with the simulated distribution based on 10’000 paths (in solid 
line). In the middle graph, we zoom the plot over the [2,6] x [0.94,1] domain. On the right-hand side, we plot the difference between 
the simulated distribution and the Student-t approximation. The dotted lines delimit the confidence band obtained by bootstrapping the 
simulated distribution 500 times. The shaded regions indicate the Ist, 5th, 95th and 99th percentiles. 
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6.3 Decision theory 


Using the Bayesian approach leads to an interesting problem of decision theory: 
the choice of a Bayes point estimate within the whole VaR posterior density. 
In this section, we present a short review of decision theory and introduce the 
asymmetric Linex loss function. The use of asymmetric loss functions better 
characterizes the views of market participants where the impact of underesti- 
mation and overestimation can be significantly different. The Linex loss function 
has proved to be advantageous in many fields, especially for financial applica- 
tions. 

To keep the notation as general as possible, the decisions are formulated in 
terms of w which can either be viewed as a one-dimensional parameter or a point 
forecast. 


6.3.1 Bayes point estimate 


As is the case in economics, statistical decisions are made based on expected 
ranking. In economics this ranking is achieved with the help of a utility func- 
tion while we use a loss function in statistics. The Bayesian statistical decision 
consists in the choice of a point estimate over the posterior density of the pa- 
rameters. 

Let us assume that a decision maker needs to choose a point estimate @ 
and the true state of nature is w. In the Bayesian framework, the parameter 
w is random and its uncertainty is fully characterized by its posterior density 
p(w | y). Furthermore, we define the loss function “(@,w) which is the loss 
incurred when w is the true state of nature and @ is a point estimate. Then, the 
Bayes estimate, also referred to as the optimal point estimate, denoted by Gy, 
is the parameter which minimizes the posterior risk Ryv(@ | y). Formally, the 
Bayes estimate is defined as follows: 


Oy + argmin Ry(O|y) (6.10) 
where: 


Ry(@|y)= / LG,w)plw |y)dw . 


The problem of point estimation of a location parameter or forecast is most 
often treated as a symmetric problem in which positive and negative estimation 
errors of the same magnitude are considered to be equally serious; thus, the loss 
function # is symmetric. The most used loss functions are the squared error loss 
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(henceforth SEL), #(@,w) = (@ — w)?, and the absolute error loss (henceforth 
AEL), 2(@,w) = |G — w|. Indeed most of the existing VaR literature ignores 
the asymmetric loss relevant for different economic agents. However, the impact 
of overestimating or underestimating VaR can be quite different. As quoted by 
Knight, Satchell, and Wang [2003, p.335]: 


“From the perspective of the fund manager in a bank, the loss of over- 
estimating is usually much greater than that of underestimating, as the 
reserve capital exceeds the capital required by regulation and earns little 
or no return at all. On the other hand, from the regulator’s perspective, 
systematic failure would be increasing in the degree to which each bank’s 
losses actually exceed their capital reserves. So underestimating will re- 
sult in more loss for the regulator.” 


This suggests the need of an appropriate asymmetric loss function when choosing 
a point estimate within the VaR density. 

We point out that, in light of the capital structure theory, the relevance of 
an asymmetric loss function for banks is questionable since capital reserves do 
not have to be held in cash. In this case, if regulators demand a higher capital 
reserve, the bank will just have to rearrange its capital structure, which does 
not necessarily increase capital cost. This suggests that bank managers should 
in fact not be interested in minimizing regulatory capital. While this argument 
is valid for the bank as a whole, it does not hold at the trading desk level since 
it is common that traders and fund managers need a buffer in cash for facing 
market risk exposures. 


6.3.2 The Linex loss function 


The Linex loss function is employed in the analysis of several central statistical 
estimation and prediction problems. Varian [1974] motivates the use of the Linex 
loss function on the basis of an example in which there is a natural imbalance 
in the economic results of estimation errors of the same magnitude. Varian 
argues that the Linex loss is a rational manner to formulate the consequences 
of estimation errors in real estate assessment. Christoffersen and Diebold [1996, 
1997] use the Linex loss function in a study of optimal point prediction where 
different asymmetric loss functions are tested. More recently, Hwang, Knight, 
and Satchell [1999, 2001] derive the Linex one-day ahead volatility forecast for 
various volatility models. The empirical results of these authors suggest the 
Linex loss function to be particularly well-suited in financial applications. In 
addition, we note that other fields than quantitative finance make use of the 
Linex loss function. An example is given in the field of hydrology with the 
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estimation of peak water level in the construction of dams or levies. In that 
case, overestimation represents a conservative error which increases construction 
costs, while underestimation corresponds to the much more serious error in 
which overflows might lead to huge damages in the adjacent communities. 

In its reduced form, the Linex loss function is given by: 


-L£(@,w) = exp[aA] —aA—-1 (6.11) 


where a € R* and A = (@ — w) denotes the scalar estimation error in using @ 
when estimating w. From expression (6.11), we note that: 


e £ is a convex function of A; 


f is decreasing for A €] — 00, 0[ and increasing for A €]0, oof; 


e fora > 0, Y grows exponentially in positive A but behaves ap- 
proximately linearly for negative values of A. In this case, the Linex 
loss function imposes a substantial penalty for overestimation, 7.e., 


when @ > w; 


e for |a| ~ 0, & is almost symmetric and not far from a squared 
error loss function. Indeed, on expanding: 
A 2 
exp[aA] ~ 1+ aA+ oy 
and replacing it in expression (6.11), the loss function becomes 
proportional to the SEL function. Thus for small values of a, the 
SEL function is approximately nested within the Linex function. 


In the upper graph of Fig. 6.2, we display the Linex loss function for parameter 
a = 0.5 in dotted line, a = 1 in dashed line and a = 2 in solid line. We can 
notice the impact on the shape of the loss function of larger values of parameter 
a. Indeed, as a increases, the asymmetry accentuates. In the lower part of the 
figure, we show the Linex loss function for parameter a = 0.1 together with the 
appropriately scaled SEL function. The loss functions are almost the same on 
the interval. 

As verified by Zellner [1986], the derivation of the Bayes estimator of w is 
straightforward under the Linex loss function (6.11). The optimization prob- 
lem (6.10) yields: 


dy = 2 In f expl—au)p(w | y)dw) 
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provided that the integral is finite. Hence, the key to Linex estimation is to find 
the moment generating function of p(w | y), which can be estimated using the 
posterior sample. 
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Fig. 6.2. Linex loss function. In the upper graph we plot the Linex function for 
different values of parameter a. In the lower part, we show the Linex loss function for 
parameter a = 0.1 in solid line together with the SEL function in dashed line. We 
recall that A = (@ — w) where @ is the point estimate and w is the true parameter 
value. 
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6.3.3 The Monomial loss function 


We end this section by noting that other interesting asymmetric loss functions 
are readily available in the statistical literature. A general class of asymmetric 
loss functions, referred to as Monomial-splined functions by Thompson and Basu 
[1995], is defined as follows: 


a,x|AlP if ASO 


6.12 
ag X |A|P if A<O ( ) 


Y(A)= 


where A = (@—w), a; € R* (i = 1,2) and p € N*. This class provides asymmet- 
ric loss functions for aj 4 ag; when a, > a2, an overestimation incurs more loss 
than an underestimation and inversely when a, < az. From expression (6.12), 
we note the two following special cases: 


e p=1: linear-linear loss function; when a; = a2 we obtain the AEL 
function; 


e p= 2: quadratic-quadratic loss function; when a, = a2 we obtain 
the SEL function. 


It is often easier to work with a reparametrization of expression (6.12). Let us 
a 
Tarr 
the following loss function: 


define g = and make use of the homogeneity property so that we obtain 


L(A) = (at (1 — 2a)ya<o}) |AP - (6.13) 


This function is particularly interesting when p = 1. In this case, it turns out 
that the optimal point estimate wy is nothing else than the gth percentile 
of the posterior density p(w | y). Hence, expression (6.13) gives a statistical 
justification to the choice of a posterior percentile as Bayes point estimate. 
E.g., choosing the 95th percentile of the VaR density would be optimal for an 
agent whose loss function is given by (6.13) with g = 0.95. Finally, we note 
that numerical methods are needed to find Bayes point estimates for the loss 
function (6.13) when p > 1. 

In Fig. 6.3, we display the loss function given in expression (6.13) for pa- 
rameter g = 0.5 in dotted line, g = 0.75 in dashed line and g = 0.95 in solid 
line. As parameter q increases, the asymmetry becomes more pronounced and 
the impact of an overestimation, i.e., A > 0, is larger compared to the impact 
of an underestimation. The function is symmetric for g = 0.5. 
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Fig. 6.3. Monomial loss function for different values of parameter q. We recall that 
A = (© — w) where G is the point estimate and w is the true parameter value. 


6.4 Empirical application: the VaR term structure 


In this section, we estimate the term structure of the VaR when the P&L dy- 
namics is described by a GARCH(1,1) model with Normal and Student-t dis- 
turbances. Our analysis is inspired by the paper of Guidolin and Timmermann 
[2006] which considers the impact of different econometric specifications to the 
shape of the VaR term structure. While the authors report significant differ- 
ences between the models, they do not account for parameter uncertainty in 
their analysis. This is indeed a weakness of their approach, as recognized by the 
authors [see Guidolin and Timmermann 2006, p.307]: 


“We ignored parameter estimation uncertainty in our analysis, but this 
could have important effects on the results.” 


The Bayesian approach provides a natural framework for investigating this 
point. As shown in Sects. 6.2.1 and 6.2.2, the VaR can be expressed as a 
function of the GARCH(1,1) parameters under both Normal and Student-t 
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specifications. Consequently, the parameter uncertainty estimated by the joint 
posterior sample can be used to estimate the density of the VaR. in a convenient 
manner. 

An additional justification for the use of the Bayesian approach in our con- 
text is given by Miazhynskaia and Aussenegg [2006] who compare the Bayesian 
and traditional techniques for estimating GARCH models. In particular, they 
conclude that the Bayesian approach is an adequate framework with less un- 
certainty in VaR estimates compared to other VaR methods such as resampling 
technique and asymptotic Normal approximation. They also mention the inter- 
esting issue of determining a single VaR point estimate [see Miazhynskaia and 
Aussenegg 2006, p.262]: 


“Open questions for future research are how the total VaR. distribution 
can be used in market risk management and how to account for VaR 
uncertainty in choosing traditional VaR point estimates used to calculate 


capital requirements for financial institution.” 


This is precisely what we aim to achieve in a rational manner through the 
decision theory framework. 


6.4.1 Data set and estimation design 


Our empirical analysis uses the Deutschmark vs British Pound exchange rate 
daily log-returns over a sample period ranging from January 3, 1985, to Decem- 
ber 31, 1991, for a total of 1’974 observations. This data set was used in the 
empirical analysis of Chap. 3. 

We consider daily log-returns so that the VaR term structure focuses on 
short-term horizons. This is of primary interest for traders and risk managers 
who adjust the bank’s portfolios on a daily basis. The methodology can also be 
applied to longer time span log-returns as this is done in Guidolin and Timmer- 
mann [2006]. In this manner, a term structure for mid- and long-term horizons is 
obtained. Note however that modeling monthly or quarterly financial data would 
probably require more complicated models than the GARCH(1, 1) specification, 
to account for structural breaks in the time series, for instance. This could dras- 
tically complicate the approximation methodology developed in Sect. 6.2.2, in 
particular to find the first four moments conditioned on the model parameters 
and information set. 

The estimation of the GARCH(1,1) models is achieved by using the rolling 
window methodology. This procedure is heavily used in finance and financial 
risk management. The rationale behind it is to act as if we were moving over 
time, using past observations to estimate the model and test the performance 
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over a prediction window. This is based on the assumption that older data are 
not available or are irrelevant due to structural breaks, which are so complicated 
that they cannot be modeled. Conceptually, this method aims to take account 
for more recent information in a simplified framework and it has proved to be 
effective in many financial applications. 

We structure the estimation procedure as follows: 750 log-returns, which is 
about three trading years, are used to estimate the models. Then, the next 50 
log-returns, which is slightly less than one quarter, are used as a forecasting 
window. In the next step, the estimation and forecasting windows are moved 
together by 50 days ahead, so that the forecasting windows do not overlap. In 
this manner, the model parameters are updated every quarter and the estimation 
methodology fulfills the recommendations of the Basel Committee in the use of 
internal models [see Basel Committee on Banking Supervision 1996b]. When 
applied to our data set, the estimation design leads to the generation of 24 
estimation windows. The non-overlapping forecasting windows represent a total 
of 24 x 50 = 17200 observations. An illustration of the methodology is shown 
in Fig. 6.4 where we plot the first three observation windows excerpt from our 
data set; the vertical lines separate the estimation and the forecasting windows. 


Note that the standard practice when using the rolling window methodology 
in the context of GARCH models consists in moving the window by a single 
day ahead. While this procedure can be achieved quite rapidly when estimating 
the model by the Maximum Likelihood technique, this can become a computa- 
tional burden with the Bayesian approach, since at each step, we need to run 
the MCMC scheme again. This problem is however only relevant in an ex-post 
framework; a portfolio or risk manager could run the Bayesian estimation of the 
model every day, without encountering these computational difficulties. 
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Fig. 6.4. Estimation and forecasting windows (first 3 windows out of 24). The 750 
log-returns used for the estimation are shown in solid line and the 50 out-of-sample 
log-returns are shown in dotted line. The vertical line separate the estimation and the 
forecasting windows. At each step in the procedure, both windows are moved together 
by 50 days ahead. 


6.4.2 Bayesian estimation 


As prior densities for the scedastic function’s parameters @ and 3, we choose 
truncated Normal densities with zero mean vectors and diagonal covariance 
matrices whose variances are set to 10’000. In the case of Student-t disturbances, 
we use the translated Exponential as a prior density for the degrees of freedom 
parameter; the hyperparameters are set to \ = 0.01 and 6 = 4; the prior mean 
is therefore 104 and the prior variance 10’000. The parameter 6 is set so that the 
conditional variance and conditional fourth moment exist, which allows the use 
of the approximation for the predictive density based on the first four moments 
(see Sect. 6.2.2 for details). For each estimation window, two chains are run for 
10’000 passes each and the convergence diagnostic test by Gelman and Rubin 
[1992] is applied to guarantee a good convergence of the algorithm. From the 
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overall MCMC output, we discard the first 5’000 draws and merge the two chains 
to get a final sample of length 10’000. 


6.4.3 The term structure of the VaR density 


As a preliminary analysis, we consider the first observation window excerpt from 
our data set and estimate the (conditional) term structure of the VaR at risk level 
o = 0.95. We consider risk horizons ranging from one day to fifteen days for the 
VaR density estimated under both GARCH(1,1) Normal and Student-t models 
using approximation (6.6). The two term structures are depicted in Fig. 6.5; the 
lines give the median point estimates while the shaded regions depict the 95% 
confidence intervals of the densities. From this graph, we note that the VaR is a 
monotone decreasing function of the time horizon for both models. The Student-t 
specification leads to lower median point estimates for time horizons larger than 
five days while the differences for smaller horizons are less pronounced. We also 
notice that the VaR uncertainty increases for both models with respect to the 
time horizon. Furthermore, the GARCH(1, 1) model with Student-t disturbances 
leads to higher uncertainty in VaR at each horizon compared to the Normal 
specification. Finally, we note that the VaR density is almost symmetric for all 
horizons in the Normal case while the density is left-skewed for the Student-t 
model. 

This first static analysis indicates that the density of the VaR is influenced by 
the time horizon as well as the specification of the model disturbances; leptokur- 
tic disturbances lead to a larger uncertainty in the VaR as well as a left-skewed 
density. 
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Fig. 6.5. Term structures of the VaR density at risk level 6 = 0.95 for the 
GARCH(1, 1) model with Normal and Student-t disturbances. Both densities are based 
on 10’000 draws from the joint posterior sample of the models’ parameters. 


6.4.4 VaR point estimates 


We now investigate the differences in VaR point estimates under different loss 
functions of the forecasters. The comparison of point estimates over the out-of- 
sample window has two purposes. First, it will provide a statistical counterpart 
to the graphical findings observed in the preceding section. Second, since the 
VaR point estimates are used to calculate capital requirements for financial in- 
stitutions, this analysis will give a first idea on how large the differences in risk 
capital are between agents facing different risk perspectives. In what follows, we 
concentrate the analysis on horizon s € {1,5, 10} and risk level ¢ € {0.95,0.99}. 
Note that the particular case (s = 1,¢ = 0.95) corresponds to the criterion 
employed by the popular RiskMetrics benchmark [see RiskMetrics Group 1996]. 
The case (s = 10,¢ = 0.99) is recommended by the Basel Committee on Bank- 
ing Supervision [1996a] and aims to take in consideration liquidity constraints 
encountered by the bank. 
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To compare the VaR point estimates resulting from the use of different loss 
functions, we use the following methodology: for each point in the out-of-sample 
data set, we estimate the density of the VaR for the three different time hori- 
zons and the two risk levels using approximation (6.6). Then, for each den- 
sity, we determine a point estimate for the VaR which solves the optimization 
problem (6.10) for a given loss function ; this point estimate is denoted by 
VaR 15: In what follows, we use the asymmetric Linex loss, the absolute error 
loss (AEL) as well as the squared error loss (SEL) functions for #, the latter 
being considered as the benchmark in our analysis. As an additional point es- 
timate, we use the predictive VaR defined in (6.7). In this case, for each draw 
in the joint posterior sample, we generate a draw from the predictive density 
using (6.8) and the predictive VaR is obtained by calculating the appropriate 
percentile of the simulated density. 

Some comments regarding the different perspectives of VaR point estimation 
are in order here. In the first approach, we estimate the density of the VaR by 
simulating from the joint posterior sample and choose a point estimate which 
is optimal for a given loss function (Linex, SEL and AEL). The parameter 
uncertainty is integrated out in the second step of the procedure, when the 
posterior risk is minimized (see Sect. 6.3.1). This methodology is natural in 
combining estimation and decision making, and gives therefore an additional 
flexibility to the user. In the second approach, the parameter uncertainty is 
integrated out by averaging the conditional densities of the cumulated returns 
over the joint posterior density of the model parameters. In this case, the VaR 
point estimate (7.e., the predictive VaR) is not related to the risk preferences of 
an agent and is the same for both regulators and fund managers. Hence, agents 
differ not in the estimation of the VaR, but in the way they would make use of 
the point estimate afterwards. This approach is natural in Bayesian statistics 
but it is not the most sophisticated. It can be viewed as an extension of the 
case where there is no parameter uncertainty (i.e., ~ is constant); in this case, 
the predictive VaR would simply be estimated by a percentile of the conditional 
density of future observations. The justification of the predictive VaR on the 
grounds of the decision theory still needs to be established. 

To find an optimal VaR point estimate under the Linex loss function, we 
need first to choose a value for the parameter a in expression (6.11). The most 
direct way is to elicit the parameter through detailed discussion with a fund 
or risk manager. Due to the unavailability of this type of data, we will thus 
rely on the estimation of Knight et al. [2003] where the authors found a = 3 
based on Standard and Poors 500 index data. In our framework, since the VaR 
estimates are negative percentages, this positive parameter a implies a larger 
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penalty when the estimated VaR is underestimated (in absolute value) compared 
to the true VaR, i.e., VaR > VaR. Hence, the Linex optimal point estimate 
will be conservative in the sense that it will be located in the left tail of the 
density to avoid underestimation. Such loss function can thus be attributed to 
a regulator or a risk manager whose aim is to avoid systematic failure in VaR 
estimation. For comparison purposes, we also consider the Linex function with 
parameter a = —3. In that case, overestimating (in absolute value) the VaR, 
1.€., VaR < VaR, leads to a larger penalty so that the Linex point estimate will 
be located in the right tail of the VaR density. This is the loss perspective of 
a trader or fund manager whose aim is to save regulatory capital since it earns 
little or no return at all (as pointed out previously, traders hold a buffer in 
cash for facing market risk exposures). The AEL and SEL functions correspond 
to the perspective of an agent for which under- and overestimation are equally 
serious; the SEL leads however to a larger penalty for large deviations from the 
true VaR, compared to the AEL function. 

Once the time series of VaR point estimates for a given loss function has 
been obtained, we compare its values to the SEL benchmark. More precisely, 
for a given risk level ¢ and time horizon s we compute a time series of differences 
between the VaR point estimates obtained with the loss function Y and the SEL 
benchmark. Then, we estimate the average of deviations over the N = 17200 
out-of-sample observations as follows: 


1 —~¢ ae 
1S (ce — Vibes) 


t=1 


Results of the average deviations are reported in Table 6.1 where the table 
entries are given in hundredth percent, z.e., multiplied by 100, for convenience. 
The upper panel gives the results for the GARCH(1,1) model with Normal 
disturbances while the lower panel presents the results for the Student-t case. 
From this table, we first note that the average deviations vary considerably 
between table entries; the minimum value is 0.0001% in the case of the predictive 
VaR for Normal disturbances with (s = 1,¢ = 0.95) while the maximum is 
0.496% in the case of the Linex VaR (a = 3) for Student-t disturbances with 
(s = 10,6 = 0.99). Moreover, we note that the average size of deviations increase 
with respect to the forecasting horizon. The deviations are also larger for risk 
level ¢ = 0.99 and Student-t disturbances (except for the predictive VaR at risk 
level ¢ = 0.99). 

Deviations from the SEL benchmark are expected for the two Linex func- 
tions. Indeed, the asymmetric nature of the function leads to point estimates 
which are necessarily lower or larger than the mean point estimate. In the case 
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of the AEL function, departure from the SEL indicates asymmetric shapes for 
the VaR densities over the out-of-sample window. More precisely, the AEL point 
estimates are less conservative than the SEL on average, indicating left-skewed 
densities for the VaR. This asymmetry can also be captured by comparing the 
average deviations for the Linex loss functions. While a symmetric density for 
the VaR would imply similar values (of opposite sign) for the two Linex func- 
tions, this is clearly not the case here, especially for the Student-t density at 
time horizons s = 5 and s = 10. Finally, we can notice that the predictive VaR 
point estimates are close to the SEL point estimates for almost all risk levels 
and time horizons; for these cases, choosing a quantile of the predictive distri- 
bution or choosing the posterior mean of the VaR leads to the same VaR. point 
estimates, on average. 

In summary, the VaR. is left-skewed and the asymmetry as well as the un- 
certainty increase with respect to the time horizon and the risk level. The av- 
erage deviations are also larger when the GARCH(1, 1) model disturbances are 
Student-t distributed. At first sight, the deviations seem negligible (we recall 
that the maximum deviation is half a percent). As will be shown later in this 
chapter, the common testing methodology for assessing the performance of the 
VaR. is unable to discriminate between the point estimates but the deviations 
are large enough to imply substantial differences in terms of regulatory capital. 
This therefore gives an additional flexibility to the user when allocating risk 
capital. 
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Table 6.1. Average deviations of the VaR point estimates from the SEL 
benchmark.* 


GARCH(1, 1) with Normal disturbances 


bo = 0.95 b = 0.99 
Loss Y s=1 s=5 s=10 s=1 s=5 s=10 
Linex (a = 3) 0.233 1.385  —3.605 0.467 4.041 12.790 
Linex (a = —3) 0.230 1.332 3.226 0.459 3.602 9.470 
AEL? 0.066 0.256 0.604 0.093 0.637 1.628 
Predictive? 0.011 —0.148 —0.145 —0.344 —1.603 —2.309 
GARCH(1, 1) with Student-t disturbances 
bo = 0.95 ob = 0.99 
Loss Y s=l1 s=5 s=10 $= s=5 s=10 
Linex (a = 3) 0.318 2.496 11.598 1.087 9.911 49.603 
Linex (a = —3) 0.312 2.185 6.695 1.043 7.739 22.548 
AEL? 0.091 0.570 1.863 0.241 1.204 3.498 
Predictive? 0.013 —0.104 1.353 —0.027 0.645 0.847 


* The tables entries are given in hundredth percent, i.e., multiplied by 100. 2: 
loss function; s: time horizon (in days); ¢: risk level. 

* Absolute error loss. 

>In the case of the predictive VaR, the point estimate is the ¢°th percentile of 
the predictive density for the s-day ahead cumulative return; 6° = 1 — ¢. 


6.4.5 Regulatory capital 


In this section, we assess the financial consequences resulting from the use of a 
particular loss function when determining a VaR point estimate. To that aim, 
we will base our analysis on the notion of the regulatory capital as defined by the 
Basel II approach for market risk [see Basel Committee on Banking Supervision 
1996b]. This capital is a cushion for market risk exposures and its value is based 
on the ten-day ahead VaR at risk level ¢ = 0.99. Formally, the regulatory capital 
allocated at time t by an agent facing a loss function “& can be expressed as 
follows: 


—~ 0.99 ¢ 59 ~~ 0.99 (6.14) 


RC vz = min {Ya an 5p oe VaR ye t-i,10 

i=0 
where VaR 46 denotes the ten-day ahead VaR point estimate at time ¢t, for 
risk level ¢ = 0.99 and loss function Y. The value ¢ € [3,4] is a stress factor 
determined by the quality of the model; it is fixed by the regulators and is based 
on the forecasting performance of the model. We will set ¢ to 3.5 for simplicity 


in what follows. From formula (6.14), we note that the regulatory capital is 
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smoothed over time in order to avoid frequent adjustments of the balance sheet 
(which is costly for the bank) but can also react quickly enough to market news 
such as crashes. 

As this was done in Sect. 6.4.4 for the VaR point estimates, we calculate the 
time series of differences between the regulatory capital obtained under loss Y 
and the SEL benchmark and then compute the average of the deviations over the 
out-of-sample window. Results are reported in Table 6.2 where the table entries 
are given in percent. First, we can notice that the deviations are much larger 
than for the VaR, point estimates; the average deviations range from 0.023% in 
the case of the predictive VaR to 1.741% in the case of the Linex (a = 3), both 
for the GARCH(1, 1) model with Student-t disturbances. In general, deviations 
from the SEL are larger when the disturbances are Student-t distributed. The 
percentage of capital obtained with the AEL function is, on average, lower than 
with the SEL, indicating a left-skewed density for the regulatory capital. The 
largest deviation is obtained for the Linex function with parameter a = 3; in this 
case, a risk manager or regulator will keep aside a capital which is 1.741% larger 
than the SEL agent, for which under- or overestimation are equally serious. In 
contrast to this, a fund manager will be able to invest 0.79% more capital on 
financial markets. For this special case, there is a differential of about 2.5% 
in risk capital allocation between a risk manager and a fund manager; this is 
substantial if we imagine the amounts invested on financial markets by financial 


institutions. 


Table 6.2. Average deviations of the regulatory capital 
point estimates from the SEL benchmark.* 


GARCH(1, 1) disturbances 


Loss Y Normal Student-t 
Linex (a = 3) —0.446 -1.741 
Linex (a = —3) 0.331 0.790 
AEL? 0.056 0.122 
Predictive? —0.084 0.023 


* The tables entries are given in percent. : loss function. 
* Absolute error loss. 

® In the case of the predictive VaR, the point estimate is the 
¢*th percentile of the predictive density for the s-day ahead 
cumulative return; @° = 1— @. 


102 6 Value at Risk and Decision Theory 


6.4.6 Forecasting performance analysis 


To test the ability of our models to capture the true VaR, we compare the real- 
ization of the cumulated returns {y;,,}, with our VaR estimates for time horizon 
s € {1,5,10} and risk level ¢ € {0.95,0.99}. To that aim, we adopt the (back- 
testing) methodology proposed by Christoffersen [1998] which has become the 
standard practice in financial risk management. When the forecasting horizon 
is one day, this approach is based on the study of the random sequence {V,°} 
where: 


yeu jt Yir1 < VaR? 
‘ 0 else. 


A sequence of VaR. forecasts at risk level ¢ has correct conditional coverage if 
{V,°} is an independent and identically distributed sequence of Bernoulli random 
variables with parameter $° = (1—¢). In practice, this hypothesis can be verified 
by testing jointly the independence on the series and the unconditional coverage 
of the VaR forecasts, i.e., E(V,°) = ¢°. 

In order to test the performance of the s-day ahead VaR (s > 1), we use 


a similar methodology, based now on the study of the random sequence { Ue he 
where: 


1 if ys < VaR{, 


"= 0 else. 

In this case however, since the cumulative returns y,,, and y,,, overlap for 
|t — T| < s, the variables ve and V,~, are not independent and the usual test 
by Christoffersen [1998] cannot be applied directly. However, we can exploit the 
structure of dependence between x, to y;,, to get rid of this difficulty. Indeed, 
the construction of cumulative returns leads to the creation of spurious moving 
average effects of order (s—1) in the time series {y;,;},. We can therefore follow 
Diebold and Mariano [1995] and correct the test by Christoffersen [1998] for 
serial correlation via Bonferroni bounds. To that aim, we partition the series 
{¥Yt,s}, into groups for which we expect independence and correct unconditional 
coverage. Under the assumption that the series { Veh is (s—1) dependent, each 
of the following s sub-series: 
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will be iid Bernoulli distributed if the model for the underlying process is correct. 
Thus, a formal test with size bounded by a can be obtained by performing 
s tests, each of size a/s, on each of the s sub-series, and rejecting the null 
hypothesis if the null is rejected for any of the s sub-series. 

Forecasting results are reported in Table 6.3 where we give the p-values 
of the unconditional coverage (UC), independence (IND) as well as conditional 
coverage (CC) tests; for time horizons s = 5 and s = 10, we report the lowest 
p-value computed from the s series of VaR forecasts. Our results indicate that 


Table 6.3. Forecasting results of the VaR point estimates.* 


GARCH(1, 1) model with Normal disturbances 


= 0.95 ¢ = 0.99 
s UC IND CC UC IND CC 
1 0.026 0.761 0.081 0.018 0.052 0.009 
5 0.392 0.062 0.121 0.306 NA NA 
10 0.412 0.544 0.594 0.162 NA NA 
GARCH(1, 1) model with Student-t disturbances 

¢ = 0.95 o = 0.99 
s UC IND CC UC IND CC 
1 0.222 0.903 0.470 0.572 NA NA 
5 0.392 0.050 0.101 0.344 NA NA 
10 0.983 0.280 0.558 0.162 NA NA 


* Forecasting test by Christoffersen [1998] based on the SEL point 
estimates. ¢: risk level; s time horizon (in days); UC: p-value for the 
correct uncoverage test; IND: p-value for the independence test; CC: 
p-value for the correct conditional coverage test; NA: not applicable. 


the GARCH(1,1) model with Normal disturbances fails, at the 5% significance 
level, in forecasting the one-day ahead VaR for both risk levels. Indeed, the un- 
conditional coverage gives a p-value of 0.026 for 6 = 0.95 and 0.018 for ¢ = 0.99. 
The joint test of correct unconditional coverage and independence is however 
only rejected for risk level ¢ = 0.99. In contrast to this, the GARCH(1, 1) model 
with Student-t disturbances performs well. For longer time horizons, the models 
behave similarly well and neither the unconditional coverage nor the indepen- 
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dence tests are rejected at the 5% significance level. We point out, however, that 
the test by Christoffersen [1998] is powerful when the number of observations is 
large. In our context of 1’200 observations, the test of the ten-day ahead VaR 
is based on ten sequences of (only) 120 observations. At risk level ¢ = 0.99, 
a single violation is thus expected. The forecasting results should therefore be 
taken with caution in this case. 

We emphasize the fact that the test has been applied to the time series of SEL 
point estimates. For comparison purposes, we have also analyzed the forecasting 
performance of the alternative VaR point estimates, obtained with the Linex and 
AEL functions. In all cases, the testing methodology gave similar p-values for 
the different risk levels and time horizons. This is not surprising. Indeed, the 
differences between the VaR point estimates are small (we recall that the largest 
deviation is -0.496%) and the test by Christoffersen [1998] focuses on the number 
of times the VaR is exceeded instead of testing the size of discrepancy between 
predictions and realizations. In addition, the case where the differences between 
point estimates are important was observed for a forecasting horizon of ten days 
at risk level ¢ = 0.99, precisely the case where the power of the test is weak. 
Therefore, alternative (more powerful) tests should be developed, as recently 
pursued by Zumbach [2006]. In Sect. 7.6 of the next chapter, we will document 
that the differences between the one-day ahead VaR point estimates are large 
when the P&L dynamics is described by a Markov-switching GJR model. In this 
context, the loss function of the forecaster leads to different conclusions on the 
forecasting performance of the model, even when relying on the common testing 
methodology of Christoffersen [1998]. 


6.5 The Expected Shortfall risk measure 


While being now a standard tool in financial risk management, the VaR has 
been criticized in the research literature for several reasons, in particular: 


e the VaR does not tell anything about the potential size of loss that 
exceeds its level and, as a result, it is flawed; 


e the VaR is not a coherent measure of risk in the sense of Artzner, 
Delbaen, Eber, and Heath [1999]. In particular, it lacks the prop- 
erty of sub-additivity. 


To circumvent these problems, the concept of Expected Shortfall (henceforth 
ES) also known as Conditional VaR or CVaR. has been introduced by Artzner 
et al. [1999]. 
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Definition 6.2 (Expected Shortfall). Let Y be a univariate random vari- 
able with distribution Fy, assumed continuous for simplicity. Then the Expected 
Shortfall at risk level @ is defined as the expected value of Y below the VaR® 
level. Formally: 


VaR? 
dF 
ES? = E(Y | Y < VaR®) = Feo _¥dFY(y) 5 v) 


(6.15) 


where we recall that 6° = (1—@) for convenience and VaR® is given in Def. 6.1 
(see p.76). 


Basically, the ES risk measure is the expectation of the P&L below the VaR 
level. 

In the case of the GARCH(1,1) model with Normal and Student-t distribu- 
tions, the integral on the right-hand side of expression (6.15) can be calculated 
explicitly given the set of parameters w. Indeed, when the model disturbances 
are Normally distributed, the one-day ahead ES at risk level ¢, estimated at 
time ft, is given by: 


2 
—exp [- “e 


VI ge 


where we recall that w = (@, 3), hi41 is the conditional variance which is com- 


ES$(#) = hela (a, B) x (6.16) 


puted by recursion given the information set F; and zge is the ¢° percentile of 
the standard Normal distribution. In the case of a Student-t distribution with 
v degrees of freedom, the one-day ahead ES at risk level ¢, estimated at time t, 
can be expressed as follows: 


ES?(w) = [o(v) x hisi(a,,9)]'/” x Voe(v) (6.17) 


with: 


where in this case w = (a, 3,v), o(v) = v—2 and tge is the ¢°th percentile of the 
Student-t¢ distribution. Once a joint posterior sample of the model parameters 
is obtained, expressions (6.16) and (6.17) can be used to simulate the density of 
the one-day ahead ES risk measure at any risk level ¢. 
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In order to find the expression for the s-day ahead ES® (s > 1), we note that 
the Expected Shortfall can also be viewed as the average of VaR below the risk 
level @°. 


Proposition 6.3. Assuming that E(|Y|) < co and Fy is continuous, we can 
express the Expected Shortfall as follows: 


fe VaR" du 
=. 


(Y | Y < VaR?) 


Proof. The integral in (6.15) is transformed by the change of variable 
yu= Fy(y), so that: 


VaR? ge 
/ ydFy(y) = | VaR" du 
—oo 0 


since Fy(—oo) = 0, Fy(VaR®) = ¢°, du = dFy(y) and y = VaR" by Def. 6.1 
(see p.76). 


Using Prop. 6.3, we can estimate the s-day ahead ES at any risk level ¢ by 
integrating the s-day ahead VaR over the (0, ¢°] interval. Formally: 


oe u 
ES? ,(b) = —— (6.18) 
where VaR; ,(¢) is calculated using approximation (6.6) on page 81. The integral 
in expression (6.18) can be estimated by conventional quadrature methods. As 
in the one-day ahead VaR, the joint posterior sample can be used to simulate 
the density for the s-day ahead ES using formula (6.18). 

In Fig. 6.6, we display the (conditional) term structure of the ES density 
at risk level 6 = 0.95 for the first observation window excerpt from our data 
set. As in the VaR illustration of Sect. 6.4.3, the lines give the median point 
estimates and the shaded regions depict the 95% confidence intervals of the ES 
density. From this graph, we note that the GARCH(1,1) model with Student-t 
disturbances leads to lower median point estimates for every time horizon. In 
addition, the ES uncertainty increases for both models with respect to the time 
horizon. The Student-t model leads to higher uncertainty in ES at each horizon 
compared to the Normal specification. Finally, the left asymmetry of the density 
is visually apparent for horizons larger than five days for both models. 

A comparison with the VaR term structure displayed in Fig. 6.5 (see p.96) 
indicates that the ES density has heavier tails and a skewness which is more 
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Fig. 6.6. Term structures of the ES density at risk level ¢ = 0.95 for the GARCH(1, 1) 
model with Normal and Student-t disturbances. The density is based on 10’000 draws 
from the joint posterior sample of the models’ parameters. 


pronounced. Therefore, given preferences in risk perspectives lead to larger dif- 
ferences in ES point estimates. 


ig 


Bayesian Estimation of the Markov-Switching 
GJR(1, 1) Model with Student-t Innovations 


(...) “the application of GARCH to long time series of 
stock-return data will yield a high measure of persistence 
because of the presence of deterministic shifts in the 
unconditional variance and the subsequent failure of the 
econometrician to model these shifts.” 


— Christopher G. Lamoureux and William D. Lastrapes 


In this chapter, we address the problem of estimating GARCH models subject 
to structural changes in the parameters; namely, the Markov-switching GARCH 
models (henceforth MS-GARCH). In this framework, a hidden Markov sequence 
{s:} with state space {1,..., A} allows discrete changes in the model param- 
eters. Such processes have received a lot of attention in recent years as they 
provide an explanation of the high persistence in volatility (7.e., nearly unit root 
process for the conditional variance) observed with single-regime GARCH mod- 
els [see, e.g., Lamoureux and Lastrapes 1990]. Furthermore, the MS-GARCH 
models allow for a quick change in the volatility level which leads to significant 
improvements in volatility forecasts, as shown by Dueker [1997], Klaassen [2002], 
Marcucci [2005]. 

Following the seminal work of Hamilton and Susmel [1994], different para- 
metrizations have been proposed to account for discrete changes in the GARCH 
parameters [see, e.g., Dueker 1997, Gray 1996, Klaassen 2002]. However, these 
parametrizations for the conditional variance process lead to computational dif- 
ficulties. Indeed, the evaluation of the likelihood function for a sample of length 
T requires the integration over all K7 possible paths, rendering the estima- 
tion infeasible. As a remedy, approximation schemes have been proposed to 
shorten the dependence on the state variable’s history. While this difficulty is 
not present in ARCH type models, lower order GARCH specification of the con- 
ditional variance offers a more parsimonious representation than higher order 
ARCH models. 
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In order to avoid any difficulties related to the past infinite history of the 
state variable, we adopt a recent parametrization due to Haas et al. [2004]. In 
their model, the authors hypothesize K separate GARCH(1, 1) processes for the 
conditional variance of the MS-GARCH process {y,}. The conditional variances 
at time ¢ can be written in vector form as follows: 


mi) (a3) (at a) (his 
he ag at oe Aga 

=| 2 pa]. yeit ey (7.1) 
hi ag ay ge hey 


where © denotes the Hadamard product, 7.e., element-by-element multiplication. 
The MS-GARCH process {y;} is then simply obtained by setting: 


ye = ex(hgt)'/? 


where ¢; is an error term with zero mean and unit variance. The parameters ak, 
a’ and @* are therefore the GARCH(1, 1) parameters related to the kth state of 
the nature. Under this specification, the conditional variance is solely a function 
of the past data and current state s;, which avoids the problem of infinite history. 
In the context of the Bayesian estimation, this allows to simulate the state 
process in a multi-move manner which enhances the sampler’s efficiency. 

In addition to its appealing computational aspects, the MS-GARCH model 
of Haas et al. [2004] has conceptual advantages. In effect, one reason for spec- 
ifying Markov-switching models that allow for different GARCH behavior in 
each regime is to capture the difference in the variance dynamics in low- and 
high-volatility periods. As pointed out by Haas et al. [2004, p.498}: 


(...) “a relatively large value of a* and relatively low values of 3* in 
high-volatility regimes may indicate a tendency to over-react to news, 
compared to regular periods, while there is less memory in these sub- 
processes. Such an interpretation requires a parametrization of Markov- 
switching GARCH models that implies a clear association between the 
GARCH parameters within regime k, that is ak, a’ and B* and the 
corresponding {h/'} process.” 


The specification of the conditional variance in equation (7.1) allows for a clear- 
cut interpretation of the variance dynamics in each regime. Moreover, Haas 
et al. [2004] show that results on the single-regime GARCH(1, 1) model can be 
extended to their specification; in particular, they derive explicit formulae for 
the covariance stationarity condition, the unconditional variance as well as the 
dependence structure of the squared process {y?}. 
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To account for additional stylized facts observed in financial time series, es- 
pecially for stock indices (see Chap. 4), we will consider an asymmetric exten- 
sion of (7.1) in which the GARCH(1,1) processes are replaced by GJR(1, 1) 
processes. More precisely, in this Markov-switching GJR model (henceforth 
MS-GJR), the conditional variances at time ¢ can be written in vector form 


as follows: 
be Qo at as 
n2 | | 2 03 03 ; 
= : or : Try, 130} + : Tty,_1 <0} Ut-1 
nk ak al ak 
t 0 1 2 (7.2) 
a ht_y 
Be hey 
+ : © . 
he, 


where I;,} denotes the indicator function. In this setting, the conditional vari- 
ance in every regime can react asymmetrically depending on the sign of the 
past shocks due to the introduction of dummy variables. The leverage effect is 
present for a given state k as soon as ak > a. An interesting feature of the 
parametrization (7.2) lies in the fact that we can estimate whether the response 
to past negative shock on the conditional variance is different across regimes. 

The plan of this chapter is as follows. We set up the model in Sect. 7.1. 
The MCMC scheme is detailed in Sect. 7.2. The MS-GJR model as well as a 
single-regime GJR model are applied to the Swiss Market Index log-returns in 
Sect. 7.3. In Sect. 7.4, we test the models for misspecification by using the 
generalized residuals and assess the goodness-of-fit through the calculation of 
the Deviance information criterion and the model likelihoods. In Sect. 7.5, we 
test the predictive performance of the models by running a forecasting analysis 
based on the VaR. In Sect. 7.6, we propose a methodology to depict the one- 
day ahead VaR density and document how specific forecasters’ risk perspectives 
can lead to different conclusions in terms of the forecasting performance of the 
model. We conclude with some comments regarding the ML estimation of the 
MS-GJR model in Sect. 7.7. 


7.1 The model and the priors 


A Markov-switching GJR(1, 1) model with Student-t innovations may be written 
as follows: 
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ye =er(ohe)/* for t=1,...,T 
wid 


et S(0, 1,v) 
ae (7.3) 
O= 
Vv 
ht = eh; 


where e; is a K x1 vector defined by e; = (I¢s.=1} ee Ips:=K}) 5 Ty} is the indica- 
tor function; the sequence {s;} is assumed to be a stationary, irreducible Markov 
process with discrete state space {1,..., A} and transition matrix P = [P;;] 
where Pj; = P(si41 = j | st = 7); S(0,1,v) denotes the standard Student-t 
density with v degrees of freedom and g@ is a scaling factor which ensures that 
the conditional variance is given by h;. Moreover, we define the K x 1 vector of 
GJR(1,1) conditional variances in a compact form as follows: 


hy = Qo + (ailty,_,>0} ug al ty, ,<0}) ¥e-1 + BOhi-1 


where h; = (hj --- hf*)’, ay = (aj---a%*)’ for j = 0,1,2 and @ = (G'--- B*)’. 
In addition, we require that ag > 0, a; > 0, ag > O and GB = O, where 0 
is a K x 1 vector of zeros, in order to ensure the positivity of the conditional 
variance in every regime and set ho = 0 and yo = 0 for convenience. 

The use of a Student-t instead of a Normal distribution is quite popu- 
lar in standard single-regime GARCH literature. For regime-switching mod- 
els, a Student-t distribution might be seen as superfluous since the switching 
regime can account for large unconditional kurtosis in the data [see, e.g., Haas 
et al. 2004]. However, as empirically observed by Klaassen [2002], allowing for 
Student-t innovations within regimes can enhance the stability of the states and 
allows to focus on the conditional variance’s behavior instead of capturing some 
outliers. Moreover, the Student-t distribution includes the Normal distribution 
as the limiting case where the degrees of freedom parameter goes to infinity. We 
have therefore an additional flexibility in the modeling and can impose Normal- 
ity by constraining the lower boundary for the degrees of freedom parameter 
through the prior distribution. 

As pointed out in Sect. 5.1, the Student-t specification (7.3) needs to be 
re-expressed to perform a convenient Bayesian estimation. This is achieved as 
follows: 


yt = €4(moh,)/? for (al eer 
Et  N(0, 1) 
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where A’(0,1) is the standard Normal and ZG denotes the Inverted Gamma 
density. The degrees of freedom parameter vy characterizes the density of m; as 


pa \|v)= (5) : [r (Z)] ar Pe -32| ; (7.4) 


For a parsimonious expression of the likelihood function, we define the T x 1 


follows: 


vectors y = (y1---yr)', @ = (m,-+: wr)’ as well as s = (s,---57)' and regroup 
the ARCH parameters into the 3K x 1 vector a = (aj a) a)’. The model pa- 
rameters are then regrouped into the augmented set of parameters O = (wv, @,8s) 
where w = (a, 3,v, P). Finally, we define the T x T diagonal matrix: 


D = ¥(0) = diag({w,ee,h:}2,) 


where we recall that 0, e; and h; are both functions of the model parameters, 
respectively given by: 


e;(s;) = (Iys,=1} ag Tfe=K}). 


and: 


h;(a@, 3) =ao+ (ailty,_1>0} + Orly, <0})¥t-1 + B © hy_—1(@, 3) ‘ 
We can now express the likelihood function of O as follows: 
L(@ | y) x (det ©)~!/? exp [—4y’=7"y] . (7.5) 


In the Bayesian approach, the vector of hidden states is considered as a param- 
eter as implied by expression (7.5). 

The likelihood function (7.5) is invariant with respect to relabeling the states 
(i.e., the labeling of the states can be interchanged without affecting the like- 
lihood value), which leads to a lack of identification of the state-specific pa- 
rameters. So, without a prior inequality restriction on some state-specific pa- 
rameters, a multimodal posterior is obtained and is difficult to interpret and 
summarize. To overcome this problem, we make use of the permutation sampler 
of Friihwirth-Schnatter [2001b] to find suitable identification constraints. The 
permutation sampler requires priors that are labeling invariant. Furthermore, 
we cannot be completely non-informative about the state specific parameters 
since, from a theoretical viewpoint, this would result in improper posteriors [see 
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Diebolt and Robert 1994]. These points have therefore to be taken into account 
when choosing the prior densities. 
Conditionally on the kK x K transition probabilities matrix P = [P;;] where: 


Pij = P(si41 = J | 8 = 7) 


the prior on vector s is Markov: 


K K 
p(s | P) =2(s1) ]] [] Ba” 


i=1 j=1 


where Nj; = #{Si:41 = j | 3: = 2} is the number of one-step transitions from 
state i to j in the T x 1 vector s. The mass function for the initial state, 7(s,), 
is obtained by calculating the ergodic probabilities of the Markov chain. The 
vector of ergodic probabilities can be obtained as the sum of the columns of 
matrix (A’A)~!, where the matrix A is defined as follows: 


_ pl 
iD 
K 


where Ix isa K x K identity matrix and tx a K x 1 vector of ones [see Hamilton 
1994, Sect.22.2]. 

The prior density for the kK x K transition matrix P is obtained by assuming 
that the K rows are independent and that the density of the ith row is Dirichlet 
with parameter 7; = (Mi1-°-: %:K)! 


K 
p(P) = [[?@,) 


K K 
nig—1 
«TTT; ‘ 


i=1 j=1 


Due to the labeling invariance assumption, we require that i; = mp for 
i= 1,...,K and nj; = nq for i,j € {1,...,K;i A j}. A prior density with 
Np > Nq could be used to model the belief that the probability of persistence is 
bigger than the probability of transition. 

For the scedastic function’s parameters a and (3, we use truncated Normal 
densities: 


p(@) x N3K(@ | Has Ya)lyaso} 
p(B) « Nx(8 | He, Ya) peso} 
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where we recall that yw, and &, are the hyperparameters, 0 is a vector of ze- 
ros of appropriate size and Nq is the d-dimensional Normal density (d > 1). 
The assumption of labeling invariance is fulfilled if we assume further that the 
hyperparameters are the same for all states. In particular, we set: 


[Hal; 7 Hao > [Doli = O20 ’ [Hs], = Me; [Za] i, = 03 
fori =1,...,K, and: 
[Mal; = May, ? [Peles = Fay, 


fori = K+1,...,2K, and: 


Mal; =a ; [Peles = Ons 


fori = 2K +1,...,3K, where pia,, 02 


Og 


(j = 0,1,2), and yg, 0% are fixed 

hyperparameters. We note that matrices U, and Ug are diagonal in this case. 
The prior density of the T x 1 vector @ conditional on v is found by noting 

that c, are independent and identically distributed from (7.4), which yields: 


Tu 
2 


xoin=(2)* POM" (=) oe 2], 


Following Deschamps [2006], we choose a translated Exponential with pa- 
rameters \ > 0 and 6 > 2 for the degrees of freedom parameter: 


pv) = Aexp[—A(y — 6) ]lgs<v<oo} - 


Finally, we form the joint prior by assuming prior independence between a, 
B, (w,v) and (s, P) as follows: 


P(O) = p(a)p(B)p(@ | v)p(v)p(s | P)p(P) 


and by combining the likelihood function (7.5) with the joint prior above, we 
obtain the posterior density via Bayes’ rule: 


P(O | y) x LO | y)p() - (7.6) 


7.2 Simulating the joint posterior 


We draw an initial value: 
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El = (el, Bl, ao!, p19, sl, ploy 


from the joint prior and we generate iteratively J passes for O. A single pass is 
decomposed as follows: 


glJ] Ww p(s | al 1] ] ali Y ogl 1 yb 1) pli 1] y) 
Pll ~ p(P | sb) 
all w pla | BY 1 gl pU-4 sll y) 
BY! ~ p(B | alll wl YY sll, y) 
wl ~ pee | axl Bl, yi-M, gil, y) 

PI 


vy | lil) . 


pl w 


n (7.7), only @ and P can be simulated from known expressions. Draws of a 
and @ are achieved by a multivariate extension of the methodology proposed by 
Nakatsuma [1998, 2000]. The generation of state vector s is made by using the 
Forward Filtering Backward Sampling (henceforth FFBS) algorithm described 
in Chib [1996]. Finally, sampling v is achieved by an efficient rejection technique. 

As pointed out previously, the likelihood function and the joint prior are 
labeling invariant. Consequently, the joint posterior density in (7.6) will also be 
invariant and hence exhibit, at least theoretically, K! different modes. Therefore, 
it is important to carefully select constraints to identify the model. In effect, a 
constraint that ignores the geometry of the posterior density will not lead to 
a unique labeling and can introduce a bias toward the constraint, as shown in 
Frithwirth-Schnatter [2001b]. If a suitable identifying restriction is not available 
or is not known a priori, an elegant solution to determine these constraints 
is to use the random permutation sampler proposed by Frithwirth-Schnatter 

[2001b]. In this version of the permutation sampler, each pass of the MCMC 
scheme is followed by a random permutation of the regime definitions. Formally, 
a random permutation {Il,,...II,«} of {1,..., A} is selected with probability 
ae Then, for 7,7 € {1,..., A}, the element (i,j) of P is replaced by the element 
with indices (II,,II;). The hidden states process {s;} is substituted by {Ig, }. 


Finally, for k = 1,..., AK, parameter ak is replaced by an”, parameter a* by 


* and parameter B* by 6"*. Hence, relabeling only 


Cae parameter ak by as 
affects the scedastic function’s parameters, the state process and the transition 
probabilities while the vector @ and the degrees of freedom parameter v remain 
unchanged. 

The random permutation sampler by Frithwirth-Schnatter [2001b] is used to 
improve the mixing of the MCMC sampler and to explore the full unconstrained 


parameter space. Then post-processing the MCMC output of the random per- 


7.2 Simulating the joint posterior 117 


mutation sampler in an exploratory way, by plotting scatter plots for instance, 
can suggest an appropriate identification constraint, such as: 


OC fine (7.8) 


meaning, in this particular case, that the MS-GJR model can be identified 
through inequalities on the parameter (3 between regimes. At this stage, the 
model is estimated again under the constraint (7.8) by enforcing the corre- 
sponding permutation of the regimes. This version of the permutation sampler 
is referred to as the constrained permutation sampler. At each sweep of the 
sampler, we test whether the constraint is fulfilled. If not, we order the pairs 
{1, B1},..., {K, B* } with respect to the second component. The first component 
{I,,...,«} of the ordered pairs defines the correct permutation of reordering 
the state parameters and this permutation is applied to the state-specific com- 
ponents, as this was done for the random permutation sampler. If the model is 
identifiable up to permutations of the states and satisfies certain regularity con- 
ditions, the constrained posterior density will exhibit a single mode. Note that 
the selection of the constraint (7.8) is arbitrary, because there exist K! different 
ways of formulating constraints which render the model identified, namely: 


BP. 2 4 2p 


for all permutations {II,,...,Il«} of {1,..., AK}. At this stage, if label switch- 
ing still occurs, this might indicate that the inequality restriction (7.8) is not 
well suited or that the number K of chosen regimes is too large [see Friihwirth- 
Schnatter 2006, Sect.4.2]. We will now present the derivation for the full condi- 
tionals appearing in the MCMC scheme (7.7). 


7.2.1 Generating vector s 


The generation of posterior samples for the T x 1 vector s is carried out in 
a multi-move manner by using the FFBS algorithm. We refer the reader to 
Chib [1996] and Friihwirth-Schnatter [2006, Chap.11] for a detailed presenta- 
tion of this procedure. We mention however that the FFBS approach can be 
used since the conditional density of y, only depends on the current regime 
which is a consequence of the definition for the conditional variance h; = e/h;. 
Other specifications for the conditional variance in Gray [1996] or Klaassen 
[2002] for instance, do not allow for such an approach, as noted in Kaufmann 
and Friihwirth-Schnatter [2002, Sect.6.3]. The application of the FFBS algo- 
rithm has the potential advantage that the states are updated as a single block, 
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which avoids superfluous correlation in the vector’s components, and therefore 
enhances the sampler’s efficiency [see Frithwirth-Schnatter 2006, Sect.11.5.6]. 


7.2.2 Generating matrix P 


The full conditional density of the transition matrix can be derived without 
regard to the sampling model since P becomes independent of O and y given 
the vector of states. Indeed, the posterior density is obtained as follows: 


KK KK 
ed [xn li igi py x i i oa 
j=1 j=l j=1j=1 
K (7.9) 
“Tee 
j=1j=1 
K 
x [[ DG) 
i=1 


where 7); = (Nia +7ia--: Nix +nx) and Ni; = #{st41 = J | st = 1} is the total 
number of one-step transitions from state 7 to state 7 in the vector s. The rows 
of matrix P are independent a posteriori and the ith row follows a Dirichlet 


density with parameter 7;. 


7.2.3 Generating the GJR parameters 


The methodology used to draw vectors a and @ can be viewed as a multivariate 
extension of the approach proposed in Chap. 4 for the single-regime GJR model. 
Let us consider the following K x 1 vector: 


2 
= YtlK 
Tt 


Wt = h; 

where we define 7 = aw, for convenience and recall that tx% is a K x 1 vector of 
2 

ones. In order to simplify the notations further, we define v,; = - which yields 

Ww, = UzLK — hy. From there, we can transform the expression for the vector of 

conditional variances as follows: 
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h; = ao + (a Ity,_1>0} 7 Orly, <0})¥i-1 1 B © hy-1 


> (Ub — Wr) = OM + (Or fy,_, 30} + Oalty,_.<0}) Yea 


+ BO (U¢-1lK — We-1) 


2 ULK = Ag+ [7-1 (ily, ,>0} + Qallty, , <o}) + | © Ur-1bK 
— BOwr-1t+w;i 
2S Wi = UbK — Ap — [7-1 (Qil sy, ,>0} + Qalty,_,<o}) + ie] © Up_1lK 


+ Bowe 


Moreover, let us define w; = ew; and note that w; can be written as follows: 


we = EL wy 
2 
=u m= ( ve -1) 
wrohe 
a (xi ~ 1) hy 


where x7 denotes a Chi-squared variable with one degree of freedom. This comes 
from the fact that the conditional distribution of y; is Normal with zero mean 
and variance w,oh;. Therefore, the conditional mean of w; is zero and the con- 
ditional variance is 2h?. As in the single-regime GJR model, this variable can 
be approximated by z;, a Normal variable with a mean of zero and a variance of 
2he. The variable z, can be further expressed as 2, = e/.z, where z; is a function 


of vectors a and @ given by: 


Z1(Q, 3) = vtbK — A — [T-1(Qil gy, 50} + @2lty,_,<0}) + B] © vi-1eK 


7 B © Zr-1(Q, B) z 
(7.10) 


Then, we construct the T x 1 vector z = (21 --- zr)’ where x = e}z, as well as 


the T x T diagonal matrix: 
A = A(a, 8) = diag ({2e;h; (a, 8)}j—1) 
and express the approximate likelihood function of (a, 3) as follows: 
L(a, 8 | w,v,8,y) « (det A)~1/? exp [—42/A~1z] . (7.11) 


As will be shown hereafter, the construction of the proposal densities for pa- 
rameters a@ and @ is based on this likelihood function. 
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Generating vector a 


First, we note that the function z;(a@,) in (7.10) can be expressed as a lin- 
ear function of vector a. To show this, we simply extend the argument of the 
single-regime GJR model by using appropriate recursive transformations. More 
precisely, the ith component of the K x 1 vector z, can be written as follows: 


with the recursive transformations [*, vf and v;* given by: 
(B') =1+ By_4(6') 
v(B') = v¢-Nty_s>0} + B'vt_1(6') (7.12) 
v;*(B") = Ve ty.-1<0} aa Bor" (2) 


where 15 = uj = vj* = 0. We notice that If(e), uj (e) and u;*(e) in (7.12) are 
similar to the recursive transformations used for the single-regime GJR model. 
Let us now regroup the recursive values into a K x 3K matrix C; as follows: 


C; = 
(a) 0 0 wy (8") 0 0 w*(B) 0 0 
0 1; (87) 0 0 (87) O 0 vf*(8?) 0 
F 0 .. 0 A 0 . 0 0 .. 0 
0 + 0 (Bx) 0 “0 of(BX) 0 0 uf*(B*) 


It is straightforward to show that z, = ule — Cra, and since z, = e}z; we 
get 2, = v, — e,C,a. Then, by defining the T x 1 vectors z = (z1---+ zr)’ and 
v = (v,---vr)’ as well as the T x 3K matrix C whose tth row is e/C;, we 
end up with z = v — Ca which is the desired linear expression for z. The 
proposal density to sample vector @ is obtained by combining the approximate 
likelihood (7.11) and the prior density by Bayes’ update: 


da (a | Qa, 3,7,V,8,y) x N3K(@ | Li Dalliasey (7.13) 
with: 


2c. Cex 
fig = Ba(C'A vy + Dy pa) 
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where the T x T' diagonal matrix A = diag ({2e/h7 (a, 8)}7_,) and @ is the 
previous draw of @ in the M-H sampler. A candidate a* is sampled from this 
proposal density and accepted with probability: 


rin Te ee \ 
P(&, B,,v,8,P | y) qa(ax | &,B,a,v,8,y)" J 


Generating vector 3 


The function z:(a@, 3) in (7.10) could be expressed, in the previous section, as 
a linear function of a but cannot be expressed as a linear function of vector @. 
To overcome this problem, we linearize the K x 1 vector z;() by a first order 
Taylor expansion at point B: 


where B is the previous draw of 8 in the M-H sampler. Furthermore, let us 
define the following: 


dz 


r= 2(8)+GiB , = 


(7.14) 


B=B 


where the K x K matrix G; can be computed by the following recursion: 
Gp 2 hale = 44+ 6.48 


where Z,_1 is a K x K diagonal matrix with z1-1(8) in its diagonal, Ix is a 
K x K identity matrix and Go is a K x K matrix of zeros. This recursion is 
simply obtained by differentiating (7.10) with respect to 3. From the definitions 
in (7.14) we get a, ~ r; — G,@ and the approximation for z, is obtained as 
Zz ~ ry—e. GB where r; = ejr;. Let us now define the Tx 1 vector r = (r1--+ rr)’ 
as well as the Tx K matrix G whose tth row is e.G;. It turns out that z ~ r—G@, 
thus we can approximate the exponential of the approximate likelihood (7.11) 
with: 
exp [—4(r — GB)'A7*(r — GB)] . 


The proposal density to sample vector @ is obtained by combining this approx- 
imation with the prior density by Bayes’ update: 


qe(B | a, 3,,V,8,y) «NK (GB | fis, &p)lya>0} (7.15) 


with: 


122 7 MS-GJR(1,1) Model with Student-t Innovations 
G-1l is pvy-l =i 
Ug =GA'G+X, 
fig = Ne(G’A“'r + E53" pg) 


where the T x T diagonal matrix A = diag ({2e/h? (a, B) /-1). A candidate 3* 
is sampled from this proposal density and accepted with probability: 


min p(a, B*,w,V,s | y) qa(B | a, 3" ,@,V,8,y) 1 
p(a, B,w,V,s | y) qe (* | a, B,w,V,8,y) 


7.2.4 Generating vector w 


The components of @ are independent a posteriori and the full conditional 
posterior of a; is obtained as follows: 


xw, ? exp -=| (7.16) 


with: 


where we recall that hy = e}h:(a,@) and @ = “*. Expression (7.16) is the 


Vv 


kernel of an Inverted Gamma density with parameters moh and 0. 


7.2.5 Generating parameter v 


Draws from p(v | @) are made by optimized rejection sampling from a trans- 
lated Exponential source density. This is achieved by following the lines of 
Sect. 5.2.4. 

Finally, we note that the computer code and the correctness of the algorithm 
are tested as in previous chapters; the testing methodology is applicable to the 
constrained as well as unconstrained versions of the permutation sampler. 


7.3 An application to the Swiss Market Index 


We apply our Bayesian estimation method to demeaned daily log-returns {y;} of 
the Swiss Market Index (henceforth SMI). The sample period is from November 
12, 1990, to December 16, 2005 for a total of 3’800 observations and the log- 
returns are expressed in percent. The data set is freely available from the website 
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http://www.finance.yahoo.com. Note that September 11, 2001, has not been 
recorded by the data provider since the stock markets closed after the terrorist 
attacks for a few days. From this time series, the first 2’500 observations (up 
to November 2001), which represent slightly less than two third of the data set, 
are used to estimate the model while the remaining 1’300 log-returns are used 
in a forecasting performance analysis. 

The time series under investigation is plotted in the upper part of Fig. 7.1 
where the vertical line delimits the in- and out-of-sample observation windows. 
We test for autocorrelation in the times series by testing the joint nullity of au- 
toregressive coefficients for {y;}. We estimate the regression with autoregressive 
coefficients up to lag 15 and compute the covariance matrix using the White 
estimate. The p-value of the Wald test is 0.5299 which does not support the 
presence of autocorrelation. When testing for the autocorrelation in the series 
of squared observations {y?}, we strongly reject the absence of autocorrelation. 
This is in line with the autocorrelogram of {y?} plotted in the lower part of 
Fig. 7.1. The autocorrelations are large and significantly different from zero up 
to lag 70. As an additional data analysis, we test for unit root using the test by 
Phillips and Perron [1988]. The test strongly rejects the I(1) hypothesis. 

We estimate the single-regime GJR(1,1) model as well as the two-state 
Markov-switching GJR(1,1) model henceforth referred to as GJR and MS-GJR 
for convenience. Both models are estimated using the MCMC scheme presented 
in Sect. 7.2. The estimation of the GJR model is obtained as a simplified ver- 
sion of the algorithm when K = 1 by setting the T’x 1 vector s to a vector of ones 
and omitting the generation of the transition matrix. For the hyperparameters 
on priors p(@) and p(@), we set Ua,(i = 0, 1,2) and yg to zero mean vectors and 
choose diagonal covariance matrices for Nq,(i = 0,1,2) and Ug. The variances 
are set to 02, = 03 = 10’000 (i = 0,1,2) so we do not introduce tight prior 
information in our estimation. In the case of the prior on the degrees of freedom 
parameter, we choose \ = 0.01 and 6 = 2; this therefore ensures the existence 
for the conditional variance. Finally, the hyperparameters for the prior on the 
transition probabilities are set to mi; = 2 and mj; = 7j; = 1 for i,j € {1,2} so 
that we have a prior belief that the probabilities of persistence are bigger than 
the probabilities of transition. 

For both models, we run two chains for 50’000 iterations each and assess the 
convergence of the sampler by using the diagnostic test by Gelman and Rubin 
[1992]. The convergence appears rather quickly, but we nevertheless consider the 
first half of the iterations as a burn-in phase for precaution. For the GJR model, 
the acceptance rates range from 88% for vector a to 97% for 3 indicating that 
the proposal densities are close to the exact conditional posteriors. The one- 
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Fig. 7.1. SMI daily log-returns (upper graph) and sample autocorrelogram of the 
squared log-returns up to lag 100 (lower graph). The vertical line in the upper graph 
delimits the in-sample and out-of-sample observation windows. 


lag autocorrelations in the chain range from 0.52 for ay to 0.96 for @ which is 
reasonable. For the MS-GJR model, the random permutation sampler is run first 
to determine suitable identification constraints. In Fig. 7.2, we show the contour 
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plots of the posterior density for (3",a%), (G*, ai) and (3",a), respectively. 
Note that the state value k is arbitrary since all marginal densities contain 
the same information [see Friihwirth-Schnatter 2001b]. As we can notice, the 
bimodality of the posterior density is clear for the parameter 3 on the three 
graphs, suggesting a constraint of the type 6! < 8? for identification. Therefore, 
the model is estimated again by imposing this constraint at each sweep in the 
sampler and the definition of the states is permuted if the constraint is violated. 
In that case, label switching only appeared 16 times after the burn-in phase thus 
confirming the suitability of the identification constraint. The acceptance rates 
obtained with the constrained version of the permutation sampler range from 
22% for the vector a to 93% for 3. The one-lag autocorrelations range from 0.82 
for a? to 0.97 for 62. We keep every fifth draw from the MCMC output for both 
models in order to diminish the autocorrelation in the chains. The two chains 
are then merged to get a final sample of length 10’000. Finally, we note that 
a three-state Markov-switching GJR model has also been estimated. However, 
post-processing the MCMC output has not allowed to find a clear identification 
constraint. 

The posterior statistics for both models are reported in Table 7.1. In the 
case of the GJR model (upper panel), we note the high persistence for the con- 


oife2 as well as the 


ditional variance process, measured by @ + 3 where @ = 
presence of the leverage effect. The estimation of the probability P(a2z > a1 | y) 
is 0.999, supporting the asymmetric behavior of the conditional variance. The 
low value for the estimated degrees of freedom parameter indicates conditional 
leptokurtosis in the data set. In the MS-GJR case (lower panel), we note also 
the presence of the leverage effect in both states. A comparison of the scedastic 
function’s parameters between regimes indicates similar 95% confidence intervals 
for the components of the vectors a, and @2 while the difference for compo- 
nents of the ao vector is more pronounced. Indeed, for 7 = 0,1, 2, the estimated 
probabilities P(a} > a? | y) are respectively 0.774, 0.397 and 0.543. As in the 
single-regime model, the posterior density for the degrees of freedom parameter 
indicates conditional leptokurtosis. We note however that the posterior mean 
and median are larger than for the GJR model. The posterior means for proba- 
bilities pj, and p22 are respectively 0.997 and 0.995 indicating infrequent mixing 
between states. Finally, the inefficiency factors (IF) reported in the last column 
of Table 7.1 indicate that using 10’000 draws out of the MCMC sampler seems 
appropriate if we require that the Monte Carlo error in estimating the mean 
is smaller than one percent of the variation of the error due to the data. We 
recall that the IF are computed as the ratio of the squared numerical standard 
error (NSE) of the MCMC simulations and the variance estimate divided by the 
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number of iterations (i.e., the variance of the sample mean from a hypothetical 
iid sampler). The NSE are estimated by the method of Andrews [1991], using a 
Parzen kernel and AR(1) pre-whitening as presented in Andrews and Monahan 
[1992]. As noted by Deschamps [2006], this ensures easy, optimal, and automatic 
bandwidth selection. 

In Fig. 7.3, we display the marginal posterior densities for the MS-GJR 
model parameters. First, we note that the use of the constrained permutation 
sampler leads to marginal densities which are unimodal. Furthermore, we clearly 
notice that most of these densities are skewed. More precisely, the densities for 
the components of vector @ are right-skewed while components of ( are left- 
skewed. In the case of parameters aj and a7, the modes of the densities are close 
to the lower boundary of the parameter’s space, suggesting that the parameters 
are close to zero. Finally, we can notice that the posterior densities for p;, and 
p22 are strongly left-skewed. 

Some probabilistic statements on nonlinear functions of the parameters can 
be straightforwardly obtained by simulation from the joint posterior sample 
{pbl}7_. In particular, we can test the covariance stationarity condition and es- 
timate the density of the unconditional variance when this condition is satisfied. 
Under the GJR specification, the process is covariance stationary if @+ 8 < 1 
where we recall that @ = or fon for notational purposes. The estimated prob- 
ability P(@ + 8 < 1 | y) is one. Hence, the unconditional variance exists and 
is given by ices} the estimation of its posterior mean is 1.179 with a 95% 
confidence interval given by [1.173,1.189]. These estimations can be compared 
with the empirical variance of the process which is 1.136. In this case, the single- 
regime model slightly overestimates the variability of the underlying time series. 
For the Markov-switching model, our simulation study indicates that the process 
is covariance stationary in each state. The posterior mean of the unconditional 
variances is 0.56 in state 1 and 2.00 in state 2 with 95% confidence intervals 
respectively given by [0.557,0.563] and [1.992,2.012]. The unconditional vari- 
ance of the process in state 1 is about four times lower than the one in state 
2; we will therefore refer state 1 as the low-volatility regime and state 2 as the 
high-volatility regime. As found by Haas et al. [2004, Eq.11, p.500], the Markov- 
switching GARCH process is covariance stationary if €(/) < 1, where €(M) 
denotes the largest eigenvalue in modulus of matrix M. This matrix is con- 
structed from the model parameters and, in the case of the MS-GJR. model, it 
is given by: 


Table 7.1. Estimation results for the GJR model (upper panel) and 
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MS-GJR model (lower panel).* 


GJR model 

y w Yos Wo.o25 Yoo75 min max NSE IF 
ao.6--: 0.066 =~0.065 ~=—0.041 0.099 0.021 0.156 0.356 5.58 
a, 0.060 0.059 0.028 0.098 0.005 0.162 0.237 1.81 
ag 0.207 0.205 0.148 0.278 0.097 0.359 0.690 4.33 
6B 0.809 0.809 0.750 0.861 0.656 0.911 1.163 16.22 
vy 8.083 7.954 6.258 10.580 4.871 13.930 34.643 9.79 
MS-GJR model 

y w Vos Woo2 Wo.o75 min max NSE IF 
ag 0.245 0.241 0.149 0.362 0.100 0.518 2.407 19.26 
ae 0.184 0.178 0.089 0.327 0.046 0.518 1.939 10.45 
at 0.020 0.015 0.001 0.063 0.000 0.145 0.276 2.61 
a? 0.027 0.023 0.001 0.073 0.000 0.135 0.302 2:33 
as 0.229 0.224 0.123 0.361 0.074 0.534 1.278 4,21 
aa 0.220 0.215 0.136 0.332 0.090 0.462 1.140 5.21 
Bi 0.486 0.440 0.212 0.642 0.004 0.746 4.454 16.80 
3? 0.782 0.785 0.670 0.866 0.582 0.907 2.090 18.33 
vy 9.459 9.264 7.051 12.880 5.881 23.740 55.931 13.45 
pu 0.997 0.997 0.992 0.999 0.982 1.000 0.022 1.23 
pi2 0.003 0.003 0.001 0.008 0.001 0.018 0.022 123 
pa 0.005 0.004 0.001 0.011 0.001 0.023 0.027 1.13 
p22 0.995 0.996 0.989 0.999 0.978 1.000 0.027 1.13 


* wd: posterior mean; wg: estimated posterior quantile at probability ¢; 
min: minimum value; max: maximum value; NSE: numerical standard 
error (x10*); IF: inefficiency factor (i.e., ratio of the squared numerical 
standard error and the variance of the sample mean from a hypotheti- 
cal iid sampler). The posterior statistics are based on 10’000 draws from 


the constrained posterior sample. 
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(7.17) 


Using the posterior sample we can thus estimate the density 


of €(M) by substituting the values of the draws for the model parameters in the 


definition (7.17). In the upper part of Fig. 7.4, we present the posterior density 


for €(M). As we can notice, none of the values exceed one in our simulation. 


Thus, the model is covariance stationary. Therefore, the unconditional variance 


of the MS-GJR process exists and is given by: 
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hy = (vec PY’ x (In — M)~* x (4 @ a) (7.18) 


where 7 is the 2 x 1 vector of ergodic probabilities of the Markov chain, [4 
is a 4 x 4 identity matrix, vec denotes the vectorization operator which stacks 
the columns of a matrix one underneath the other and ® denotes the Kro- 
necker product. Derivation of formula (7.18) can be found in Haas et al. [2004, 
p.501]. The posterior density of the unconditional variance is shown in the lower 
part of Fig. 7.4. Its posterior mean is 1.134 with a 95% confidence interval of 
[1.128,1.139]. In this case, the confidence band for the mean contains the em- 
pirical variance of 1.136 contrary to the one in the GJR model. This suggests 
that the Markov-switching model is more apt to reproduce the variability of the 
data. 

Finally, since the states vector s = (s;--- 87)’ is considered as a parameter in 
the MCMC procedure, the draws {sl] Joy can also be stored and used to make 
inference about the smoothed probabilities. Theses probabilities are estimated 
as the percentage of replications of s; corresponding to regime k: 


J 
1 
P(ss =k] y) * > » 1,14} 
j=l 


In Fig. 7.5, we present the smoothed probabilities for the high-volatility regime 
(solid line, left axis) together with the in-sample daily log-returns (circles, right 
axis). The 95% confidence bands are shown in dashed lines but are almost indis- 
tinguishable from the point estimates. The beginning of year 1991 is associated 
with the high-volatility state. Then, from the second half of 1991 to 1997, the 
returns are clearly associated with the low-volatility regime, with the exception 
of 1994. From 1997 to 2000, the model remains in the high-volatility regime with 
a transition during the second semester 2000 to the low-volatility state. 
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Fig. 7.3. Marginal posterior densities of the MS-GJR parameters. The histograms are based on 10’000 draws from the constrained posterior 


sample. 
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Fig. 7.4. Posterior densities of the covariance stationarity condition (upper graph) 
and the unconditional variance (lower graph) of the MS-GJR process. The histograms 


are based on 10’000 draws from the constrained posterior sample. 
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Fig. 7.5. Smoothed probabilities of the high-volatility state (solid line, left axis) together with the in-sample log-returns (circles, 


axis). The 95% confidence bands are shown in dashed lines (they are almost indistinguishable from the point estimates). 


right 
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7.4 In-sample performance analysis 


7.4.1 Model diagnostics 


We check for model misspecification by analyzing the predictive probabilities re- 
ferred to as probability integral transforms or p-scores in the literature [see, e.g., 
Diebold, Gunther, and Tay 1998, Kaufmann and Friihwirth-Schnatter 2002]. We 
make use of a simpler version of this method, as proposed by Kim, Shephard, 
and Chib [1998], which consists in conditioning on point estimates of w. To 
be meaningful, the point estimate has to be chosen when the identification is 
imposed. Hence, we consider the posterior mean ~ of the constrained poste- 
rior sample. Upon defining F;_; as the information set up to time t — 1, the 
(approximate) p-scores are defined as follows: 


K 
Xt => PY Sut | St =k, , Fr_1) P(st =k | W, Fr_1) : 
k=1 


The probability P(Y; < y | s: = k,,Ft-1) can be estimated by the Student-t 
integral and the filtered probability P(s; = k | 7, 7:1) is obtained as a byprod- 
uct from the FFBS algorithm [see Chib 1996, p.83]. Under a correct specification, 
the p-scores should have independent uniform distributions asymptotically [see 
Rosenblatt 1952]. A further transformation through the Normal integral is often 
applied for convenience. In this case, we consider u;, = ®~!(z;) where ®~1(e) 
denotes the inverse cumulative standard Normal function. If the model is cor- 
rect, these generalized residuals {uz} should be independent standard Normal 
and common tests can be used to check these features. In particular, we test the 
presence of autocorrelation in the series {u;} and {u?} using a Wald test. We also 
report the results of a joint test for zero mean, unit variance, zero skewness, and 
the absence of excess kurtosis, employing the likelihood ratio framework pro- 
posed by Berkowitz [2001]. For precisions on the testing methodology, we refer 
the reader to Haas et al. [2004, p.516]. 

In the case of the GJR model, the Wald statistic for testing the joint nullity 
of autoregressive coefficients, up to lag 15, for uz; has a p-value of 0.0868 and for 
u2, a p-value of 0.399. In the case of the MS-GJR model, the p-values are 0.0745 
and 0.464, respectively. Therefore, both models seem adequate in removing the 
volatility clustering present in the data set. The likelihood ratio framework for 
testing the first four moments of the transformed residuals yields p-values of 
0.125 for the GJR model and 0.0635 for the MS-GJR model. Overall, these 
results indicate no evidence of misspecification at the 5% significance level for 
both models. 
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7.4.2 Deviance information criterion 


In order to evaluate the goodness-of-fit of the models, we use first the Deviance 
information criterion (henceforth DIC) introduced by Spiegelhalter et al. [2002]. 
The DIC is not intended for identification of the correct model, but rather 
merely as a method of comparing a collection of alternative formulations (all of 
which may be incorrect) and determining the most appropriate. This criterion 
follows from an extension of the Deviance proposed by Dempster [1997]. A recent 
article by Berg, Meyer, and Yu [2004] has illustrated the potential advantages 
of this information criterion in determining the appropriate stochastic volatility 
model. This criterion presents an interesting alternative to the Bayes factor 
which is often difficult to calculate, especially for models that involve many 
random effects, large number of unknowns or improper priors. 

Let us denote the model parameters by @ for the moment. Based on the 
posterior density of the Deviance D(@) = —21n £(0 | y) where L(@ | y) is the 
likelihood function, the DIC consists of two terms: a component that measures 
the goodness-of-fit and a penalty term for any increase in model complexity. The 
measure of fit is obtained by taking the posterior expectation of the Deviance: 


D = Eojy[D(9)] (7.19) 


where Eg\y[e] denotes the expectation with respect to the joint posterior p(@ | y). 
Provided that D(@) is available in closed form, D can easily be approximated 
using the posterior sample by estimating the sample mean of the simulated 
values of D(@). The second component measures the complexity of the model 
using the effective number of parameters, denoted by pp, and is defined as 
the difference between the posterior mean of the Deviance and the Deviance 
evaluated at a point estimate 0: 


pp = D—D(6). (7.20) 


A natural candidate for 0 is the posterior mean Eg), (@), as suggested by Spiegel- 


halter et al. [2002]. When the density is log-concave, this point estimate ensures 
a positive pp due to Jensen’s inequality. The DIC is then simply defined as 
DIC = D+ pp and, given a set of models, the one with the smallest DIC has 
the best balance between goodness-of-fit and model complexity. 

As noted in Celeux, Forbes, Robert, and Titterington [2006], the definition 
@= Zely(9) is not appropriate in mixture models when no identification is im- 


posed. Furthermore, when the state variable is discrete and considered as a 
parameter in 0, the posterior expectation usually fails to take one of the dis- 
crete values. To overcome these difficulties, we integrate out the state vector by 
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considering the observed likelihood instead [see Celeux et al. 2006, Sect.3.1] and 
make use of the constrained posterior sample in the estimation. In the context 
of MS-GARCH models, the observed likelihood, also referred to as the marginal 
likelihood in Kaufmann and Friihwirth-Schnatter [2002, p.457] is obtained as 
follows: 


K 
Libly)= [I yon |v, se =k, Fe_1) P(s = & | a) (7.21) 


t=1 Lk=1 


where p(y | wv, 5; = k,7,_-1) can be estimated by the Student-t density and 
the filtered probability P(s; = k | , 7:1) is obtained as a byproduct from the 
FFBS algorithm [see Chib 1996, p.83]. The DIC is then defined as the sum of 
components (7.19) and (7.20), which yields: 


DIC = 2f In £(G | ¥) — 2Eyly [In Le | y)]} 


where we recall that ~ = ‘wly(#) with ~ = (a, B,v, P). 
In order to make statements about the goodness-of-fit of one model relative 


to another, it is important to consider the uncertainty in the DIC. While the 
confidence interval for D can be easily obtained from the MCMC output by 
using spectral methods as this is done for the posterior mean, the task is more 
tedious in the case of pp and hence for the DIC itself. Approximation methods 
have been experimented in Zhu and Carlin [2000] but the brute force approach 
is still the most accurate. In this method, the variability of the DIC is estimated 
by running several MCMC chains and calculating the DIC’s variance from the 
different runs. Obviously, this is extremely costly. A simpler alternative consists 
in running few MCMC runs and reporting the minimum and maximum DIC 
obtained. This gives however a crude idea of DIC’s variability. In what follows, 
we make use of a methodology which estimates the whole distribution for the 
DIC based on a resampling technique. More precisely, from the joint posterior 
sample {pll}7_,, we generate randomly B new posterior samples of size J 
by using the block bootstrap technique and estimate DIC’s components for 
these samples. By comparing the 95% confidence interval of the different DICs, 
we can find statistical evidence of differences in the fitting quality. With this 
methodology, the MCMC procedure does not need to be re-run which strongly 
diminishes the computing time. The choice of the block length is an important 
issue in the block bootstrap technique. For the block bootstrap to be effective, 
the length should be large enough so that it includes most of the dependence 
structure, but not too large so that the number of blocks becomes insufficient. 
In our analysis, we use the stationary bootstrap of Politis and Romano [1994] 
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and select the block length following the algorithm based on the spectral density 
estimation, as proposed by Politis and White [2004]. We apply the block length 
selection algorithm to each parameter’s output. The maximum value is then 
defined as the optimal block length used for block bootstrapping the constrained 
posterior sample. This ad-hoc procedure allows to keep the autocorrelation in 
the chains as well as the cross-dependence structure between the parameters. 

Results for the DIC and its components are reported in Table 7.2. They are 
based on 10’000 draws from the constrained posterior distribution. In squared 
brackets we give the 95% confidence interval obtained by the resampling tech- 
nique using B = 100 replications. We keep every tenth draw from the joint 
posterior sample in the resampling technique in order to speed up the calcula- 
tions and diminish the autocorrelation in the chains. For comparison purposes, 
we also consider the Bayesian information criterion introduced by Schwarz [1978] 
which is defined as follows: 


BIC(W) = 2In L(y | y) — nInT 


where n is the number of parameters and T the number of observations. In our 
context, T = 2500, n = 5 for the GJR model and n = 11 for the MS-GJR model 
(since parameters pz and po; are redundant due to the summability constraint). 
This criterion promotes model parsimony by penalizing models with increased 
model complexity (larger n) and sample size T. Hence, a model with the largest 
BIC is preferred. The computation of the Bayesian information criterion is based 


on the posterior mean Ey)y[BIC(w)] obtained over the 10°000 draws of the 
constrained posterior sample. 


Table 7.2. Results of the DIC and BIC criteria.* 


Model DIC D pb Eyly (BIC) 

GJR 6770.4 6765.6 4.76 -6806.07 
[6769.9,6770.8] [6765.3,6765.8] [4.49,4.93] (7.12) 

MS-GJR 6713.3 6704.4 8.84 -6804.73 


[6712.6,6713.8] [6793.9,6794.9] [8.49,9.04] (12.55) 


* DIC: Deviance information criterion; D: Deviance evaluated at the 
posterior mean w (see Table 7.1, p.127); pp: effective number of pa- 
rameters; E,,) (BIC): posterior mean of BIC(7) obtained over the 10’000 
draws of the constrained posterior sample; [e]: 95% confidence interval 
based on B = 100 replications of the constrained posterior sample; (e): 
numerical standard error (x 107). 
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From Table 7.2, we can notice that both DIC and BIC criteria favor the 
MS-GJR model. Indeed, the DIC estimates based on the initial joint posterior 
sample is 6770.4 for the GJR model and 6713.3 for the MS-GJR model. Both 
95% confidence intervals do not overlap which suggests significant improvement 
of the Markov-switching model. In the case of BIC, the differences between the 
criterion’s values are less pronounced but still the Markov-switching model is 
favored compared to the single-regime model. If we consider now the estimations 
of pp, we note that the estimated value is somewhat lower than five in the GJR 
model while about nine in the MS-GJR case. Hence, in the single-regime model, 
every parameter seems to be effective (or informative) when fitting the model 
to the data set. In the Markov-switching model however, about two third of the 
13 parameters are effective. This is in line with the estimation results where 
it was shown that parameters a; and az are almost identical across regimes. 
Furthermore, the 2 x 2 transition matrix only contains two free parameters due 
to the summability constraint. This suggests that the nine effective parameters 
of the MS-GJR model are af, a%, a1, a2, 31, B?, v, pr and pro. 

Finally, we point out that we have also considered the posterior mode: 


w = arg ss Lily) 


in the definition of pp, as suggested by Celeux et al. [2006, Sect.3.1]. It is argued 
that such a point estimate is more relevant since it depends on the posterior 
distribution of the whole parameter w, rather than on the marginal posterior 
distributions of its elements. The values of pp obtained with this new definition 
are larger for both models with 95% confidence intervals respectively given by 
(5.17,5.66] and [10.06,11.12] for the single-regime and Markov-switching models. 
While the preferred model remains the MS-GJR, the interpretation of parameter 
Pp is now questionable in the GJR case since the value of pp exceeds the total 
number of parameters. 


7.4.3 Model likelihood 


As a second criterion to discriminate between the models under study, we con- 
sider the model likelihood which may be expressed as follows: 


p(y) = / Lb | y\pb)ay 


where L(y | y) is the marginal likelihood given in (7.21) and p(w) is the joint 
prior density on w = (a, ,v,P). It is clear that the model likelihood is equal 
to the normalizing constant of the posterior density: 
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LW Lys) 


Pw ly) = ay) 


The estimation of p(y) requires the integration over the whole set of parameters 
Ww, which is a difficult task in practice, especially for complex statistical models 
such as ours. A full investigation of the various approaches available to esti- 
mate the model likelihood for finite mixture models can be found in Friihwirth- 
Schnatter [2004]. In particular, the author documents that the bridge sampling 
technique using the MCMC output of the random permutation sampler and 
an iid sample from an importance density q(w) which approximates the un- 
constrained posterior yields the best estimator of the model likelihood (7.e., 
the estimator with the lowest variance). Moreover, the variance of the bridge 
sampling estimator depends on a ratio that is bounded regardless of the tail be- 
haviour of the importance density. This renders the estimator robust and gives 
more flexibility in the choice of the importance density. 

First, let us recall that the bridge sampling technique of Meng and Wong 
[1996] is based on the following result: 


_ Lal) Lyab)av _ Ey[a(v)pv Ly] (7.22) 


Jacha)p@h |y)db— Eyy [a(eh)a()] 
where a() is an arbitrary function such that [ a(w)p(wv | y)q(W)dy > 0 and E 
denotes the expectation with respect to the importance density q(w). Repladae 
p(w | y) by a in expression (7.22) yields the key identity for bridge 
sampling: 


eg law) L(b | y)v(W)] 
ly |a(wa(h)} 


We can estimate the model likelihood for a given function a(w) by replacing the 


p(y) = 


expectations on the right-hand side of the latter expression by sample averages. 
More precisely, we use MCMC draws {y!'"1}“_, from the joint posterior p(7 | y) 
and iid draws {ple from the importance sampling density g(w) to get the 
following approximation: 


we EX aH! | yi) 


=~ 7.23 
ply) aT a pea ‘i a(ylml)g(plnl) ( ) 


Meng and Wong [1996] discuss an asymptotically optimal choice for a(w), which 
minimizes the expected relative error of the p(y) estimator for iid draws from 
p(w | y) and q(w). This function is given by: 
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i 
Lq(b) + Mp | y) © 


aly) x 


This special case of bridge sampling estimator is referred to as the optimal bridge 
sampling estimator by Frithwirth-Schnatter [2001la] and will be used in what 
follows. As the optimal choice depends on the normalized posterior p(w | y), 
Meng and Wong [1996] use an iterative procedure to estimate p(y) as a limit of 
a sequence {p;(y)}. Based on an estimate p,_1(y) of the normalizing constant, 
the posterior is normalized as follows: 


- LW |y)p() 


ac eae IC 


and a new estimate p;(y) is computed using approximation (7.23). This leads 
to the following recursion: 


Pee ily) 
oe 1 La@pl)+Mpr-1 (by) 
a(plnl) 
M fe 1 Lg (pl) +Mpe-1 (lly) 


Pt (y) = pi-1(y) X 


which can be initialized, e.g., with the reciprocal importance sampling estimator 
of Gelfand and Dey [1994] given by: 


gun) 
aw) -li Dae purr) 


Note that this latter estimator is only based on MCMC draws from the joint 
posterior. Convergence of the bridge sampling technique is typically very fast in 
practice. In our case, the estimates converged after 3-4 iterations. 

The remaining task consists in choosing an appropriate importance density 
to apply the bridge sampling technique. To that aim, we follow Kaufmann and 
Friihwirth-Schnatter [2002, pp.438-439] and Kaufmann and Scheicher [2006, 
pp.9-10]. The importance density is constructed in an unsupervised manner 
from the MCMC output of the random permutation sampler using a mixture of 
the proposal and conditional densities. Its construction is fully automatic and 
is easily incorporated in the MCMC sampler [see Friihwirth-Schnatter 2001a, 
p.39]. Formally, the importance density is defined as follows: 


R 
. 1 i r Ti r Tr 
aCe) = | Yo aaler| a", ao! oP, sy) 
r=l (7.24) 


x qa(B | al, all alll sly) x p(P | sll) x qw(v) 
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where: 


, Bl, a | Wel, ost) for r=1,...,R 

are draws from the unconstrained posterior sample, ga(a@ | e) is the proposal 
density for parameter a@ given in (7.13), gg(G | e) is the proposal density for 
parameter @ given in (7.15) (the normalizing constants are easily obtained as 
the proposals are truncated multivariate Normal densities), p(P | e) is the prod- 
uct of Dirichlet posterior densities for the transition probabilities given in (7.9). 
For the degrees of freedom parameter v, the optimized rejection technique of 
Sect. 7.2.5 does not lead to a known expression for the marginal posterior 
on v. To tackle this problem, we approximate the marginal posterior by us- 
ing a truncated skewed Student-t density whose parameters are estimated by 
Maximum Likelihood from the posterior sample {vll}J_. More precisely, the 
approximation may be written as follows: 


qv) x SS(v | Te a, Wl gv>sy 


where: 
9 r (z 1) 
SS 2 
(v | u,0°,7,7) y+ 2D (§) (ar0?)'? 
Caan ; <7 
mae ee 7 te —n20} + YT -co<v-n} 
(7.25) 


is the skewed Student-t density as defined in Ferndndez and Steel [1998, Eq.13, 
p.363]. The parameters of the density defined in (7.25) are: the location pa- 


2 > 0, the degrees of freedom parameter T > 1 


rameter pu, the scale factor o 
and the asymmetry coefficient y > 0. For y = 1, the density coincides with the 
symmetric Student-t density. In cases where y > 1, the density is right-skewed 
while it is left-skewed when y < 1. Therefore, parametrization (7.25) allows for a 
wide range of asymmetric and heavy-tailed densities. Moreover, the normalizing 
constant for q,(v) is easily obtained by conventional quadrature methods. 
Some comments are in order here. First, the generation of draws from the 
proposal densities qq(a@ | ¢) and qg(G | e) is achieved by the rejection technique. 
While we obtain good acceptance rates in our case, this method can become 
very inefficient if the mass of the density is close to the domain of truncation. 
For these cases, we would need a more sophisticated algorithm, as proposed in 
Philippe and Robert [2003], Robert [1995], to draw efficiently from a truncated 


7.4 In-sample performance analysis 141 


multivariate Normal distribution. Second, the density q,(v) is constructed in two 
steps. The parameters of the the skewed Student-t are first estimated by ML 
from the MCMC output and then the density is truncated to construct q,(v). 
An alternative approach would be to fit directly the truncated skewed Student-t 
density by ML. This is however not necessary in our case since the mass of the 
posterior on the degrees of freedom is far from the truncation domain. Finally, 
generating draws from q,(v) is achieved by the rejection technique. In cases 
where the boundary is close to the high probability mass, alternative approaches, 
such as the inversion technique, are required [see, e.g., Geweke 1991]. 

As indicated previously, the parameters of the skewed Student-t density are 
estimated by ML using the posterior sample of v. In the case of the MS-GJR 
model, we obtain the following ML estimates: 


Z=949 , G?=150 , F=16.67 and 7=1.53. 


In the upper part of Fig. 7.6, we display the fitted truncated skewed Student-t 
density (in dashed line) together with the density of the posterior sample for 
y (in solid line) obtained through Gaussian kernel density estimates [see Sil- 
verman 1986]. We can notice that the truncated skewed Student-t density ap- 
proximates the marginal closely. In the lower part of the figure, we show the 
marginal posterior for parameter 3! together with the importance density com- 
puted with R = 1’000. As the construction of the mixture (7.24) is based on 
averaging over proposal densities, where the state process is sampled from the 
unconstrained posterior with balanced label switching, the mixture importance 
density is multimodal. We also notice that the importance density provides a 
good approximation of the marginal posterior. 

In Table 7.3, we report the natural logarithm of the model likelihoods 
obtained using the reciprocal sampling estimator (second column) and the bridge 
sampling estimator (last column) for M@ = L = 17000 draws. From this table, we 
can notice that both estimators are higher for the MS-GJR model, indicating 
a better in-sample fit for the regime-switching specification. As an additional 
discrimination criterion, we compute the (transformed) Bayes factor in favor of 
the MS-GJR model [see Kass and Raftery 1995, Sect.3.2]. The estimated value 
is 2 x NnBF = 2 x (—3389.66 — (—3408.04)) = 36.76, which strongly supports 
the in-sample evidence in favor of the regime-switching model. 

A final word about the robustness of these results is in order. It is indeed 
recognized that the model likelihood is sensitive to the choice of the prior density. 
We must therefore test whether an alternative joint prior specification would 
have modified the conclusion of our analysis. To answer this question, we modify 
the hyperparameters’ values and run the sampler again. This time, we consider 
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Table 7.3. Results of the model likelihood estimators.* 


Model In po(y) In p(y) 
GJR -3405.33 (2.979) -3408.04 (2.644) 
MS-GJR -3386.14 (3.109) -3389.66 (3.191) 


* In po(y): natural logarithm of the model likelihood esti- 
mate using reciprocal sampling; In p(y): natural logarithm 
of the model likelihood estimate using bridge sampling; (e) 
numerical standard error of the estimators (x 10°). 


slightly more informative priors for the vectors a and by choosing diagonal 
covariance matrices whose variances are set to 02, = 03 = 17000 (i = 0,1,2). As 
an alternative prior on the degrees of freedom parameter, we choose \ = 0.02 
and 6 = 2, which implies a prior mean of 52. Finally, the hyperparameters for 
the prior on the transition probabilities are set to 7; = 3 and jj; = nj; = 1 for 
i,j € {1,2}. We recall that the hyperparameters of the initial joint prior were 
set to on, = o% = 107000, A = 0.01, 6 = 2, ny = 2 and mj; = nj; = 1. In this 
case, the results are similar to those obtained previously. The natural logarithm 
of the bridge sampling estimator is -3402.11 for the GJR model and -3388.09 
for the MS-GJR model, implying a (transformed) Bayes factor of 28.04. These 
results are in line with the conclusion of the previous section and confirm the 
better fit of the Markov-switching model. 
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Fig. 7.6. Importance density (in dashed line) and marginal posterior density (in solid 
line) comparison. Gaussian kernel density estimates with bandwidth selected by the 
“Silverman’s rule of thumb” criterion [see Silverman 1986, p.48]. Both graphs are based 
on 10’000 draws from the unconstrained posterior sample. 
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7.5 Forecasting performance analysis 


In order to evaluate the ability of the competing models to predict the future 
behavior of the volatility process, we study the forecasted one-day ahead Value 
at Risk (henceforth VaR), which is a common tool to measure financial and 
market risks. The one-day ahead VaR. at risk level ¢ € (0,1), denoted by VaR?®, 
is estimated by calculating the ¢°th percentile of the one-day ahead predictive 
distribution, where ¢° = (1 — ¢) for convenience. The predictive density is 
obtained by simulation from the joint posterior sample {pul}e_, as follows: 


sil a p(Se44 | pl, Fi.) 


yl ng P(Yet4 | pl, sl, Fi) 


and VaR® is then simply estimated by calculating the ¢°th percentile of the 
empirical distribution fy), fe 

In order to simulate from the predictive density over the out-of-sample obser- 
vation window, the posterior sample {pl} J, should be updated using the most 
recent information. Consequently, forecasting the one-day ahead VaR would ne- 
cessitate the estimation of the joint posterior sample at each time point in the 
out-of-sample observation window. However, such an approach is computation- 
ally impractical for a large data set such as ours. Combination of MCMC and 
importance sampling to estimate efficiently this predictive density is proposed 
by Gerlach, Carter, and Kohn [1999]. Nevertheless, for the sake of simplicity, we 
will consider the same joint posterior sample, based on the in-sample data set, 
when forecasting the VaR. 

In addition to the static GJR and MS-GJR models, we consider a GJR 
model estimated on rolling windows which is the standard practice in financial 
risk management. This methodology relies on the assumption that older data are 
not available or are irrelevant due to structural breaks, which are so complicated 
that they cannot be modeled. We refer the reader to Sect. 6.4.1 for a detailed 
presentation of this procedure. For this approach, we use 750 log-returns to es- 
timate the model and the next 50 log-returns are used as a forecasting window. 
Then, the estimation and forecasting windows are moved together by 50 days 
ahead, so that the forecasting windows do not overlap. In this manner, the es- 
timation methodology fulfills the recommendations of the Basel Committee in 
the use of internal models [see Basel Committee on Banking Supervision 1996b]. 
When applied to our data set, this estimation design leads to the generation of 
26 estimation windows for a total of 26 x 50 = 1’300 out-of-sample observations. 
In the case of the static GJR and MS-GJR models, the first 2’500 observations 
of our data set are used to estimate the models while the remaining 1’300 obser- 
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vations are used to test their predictive performance. For the three models, the 
VaR predictions are obtained for the same 1’300 out-of-sample daily log-returns. 

To verify the accuracy of the VaR estimates for the analyzed models, we 
adopt the testing methodology proposed by Christoffersen [1998]. This approach 
is based on the study of the random sequence {V,°} where: 


ie, He Yi < VaR? 
: 0 else. 


A sequence of VaR. forecasts at risk level ¢ has correct conditional coverage if 
{V,°} is an independent and identically distributed sequence of Bernoulli ran- 
dom variables with parameter $°. This hypothesis can be verified by testing 
jointly the independence on the series and the unconditional coverage of the 
VaR forecasts, i.e., E(V,”) = $°, as proposed by Christoffersen [1998]. 
Forecasting results for the VaR are reported in Table 7.4 for 


@ € {0.90, 0.95, 0.99} which are typical risk levels used in financial risk manage- 
ment. The second and third columns give the expected and observed number 
of violations. The last three columns report the p-values for the tests of correct 
unconditional coverage (UC), independence (IND) and correct conditional cov- 
erage (CC). From this table, we first note that the observed number of violations 
for the MS-GJR model are closer to the expected values than for the static GJR 
model. Indeed, at the 1% significance level, the test of correct unconditional 
coverage is not rejected for the Markov-switching model while it is strongly re- 
jected for the GJR model at risk level 6 = 0.95. The test of independence is 
not rejected for both models at the 1% significance level. We can notice that 
for risk level ¢ = 0.99 this test is not applicable since no consecutive violations 
have been observed. The joint hypothesis of correct unconditional coverage and 
independent sequence is obtained via the test of correct conditional coverage. In 
the case of the MS-GJR model, p-values are close to 0.10 for risk levels ¢ = 0.9 
and ¢ = 0.95 while it is 0.030 and 0.013 in the GJR case. We therefore reject 
the correct conditional coverage hypothesis for the static GJR model at the 5% 
significance level. These results indicate the better out-of-sample performance 
of the Markoyv-switching model compared to the static GJR model. 

When comparing the MS-GJR model with the rolling GJR model, we can 
notice that both approaches perform equally well. Indeed, for both models, the 
test of independence is rejected at risk level ¢ = 0.90 while the correct condi- 
tional coverage hypothesis is not rejected at the 5% significance level. Although 
the two models are successful in forecasting the conditional variance of the SMI 
log-returns, the MS-GJR model has two advantages over the rolling window 
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Table 7.4. Forecasting results of the VaR.* 


GJR model (static approach) 
¢ EV’) # UC IND~ CC 


0.99 13 14 0.783 NA NA 
0.95 65 89 0.004 0.624 0.013 
0.90 130 143 0.236 0.018 0.030 


GJR model (rolling windows approach) 
d E(V,’) # UC IND CC 


0.99 13 15 0.586 NA NA 
0.95 65 73 0.318 0.547 0.506 
0.90 130 126 0.710 0.032 0.093 


MS-GJR model (static approach) 
od E(V,’) i UC IND CC 


0.99 13 13 1.000 NA NA 
0.95 65 80 0.065 0.323 0.112 
0.90 130 132 0.854 0.035 0.107 


* &: risk level; E(V,): expected number of violations; 
##: observed number of violations; UC: p-value for the 
correct unconditional coverage test; IND: p-value for 
the independence test; CC: p-value for the correct con- 
ditional coverage test; NA: not applicable. 


approach. First, it is able to anticipate structural breaks in the conditional vari- 
ance process. This is achieved through the estimation of the filtered probabilities 
P(s; =k |v, F;-1), as shown in Fig. 7.7. On the contrary, the rolling window 
methodology is merely an ad-hoc approach which is unable to forecast structural 
breaks. The updating frequency as well as the length of the rolling window are 
subjective quantities, albeit some ranges are recommended by regulators, so that 
different choices might lead to significant differences in the model’s performance. 
Second, the MS-GJR model needs only to be estimated once. On the contrary, 
the parameters of the GJR model must be updated frequently to account for 
structural breaks in the time series and this can have practical consequences for 
risk management systems of financial institutions. This is a definite advantage 
of the regime-switching approach compared to the traditional rolling window 
methodology. 
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7.6 One-day ahead VaR density 


As emphasized in Chap. 6, the one-day ahead VaR risk measure can be ex- 
pressed as a function of the model parameters when the underlying time series 
is described by a single-regime GARCH(1, 1) model. It turns out that this is also 
the case in the context of Markov-switching GARCH models. In effect, the one- 
day ahead VaR at risk level ¢, estimated at time t, can be explicitly calculated 
for given w~ and future state s;41 as follows: 


VaR? (wh, S41) = [o(v) x eb p1(serr) bit (a, B)]” x tye(v) (7.26) 


where we recall that o(v) = * and tye(v) denotes the ¢°th percentile of a 


Student-t distribution with v degrees of freedom. Hence, the VaR risk measure 
can be simulated from the joint posterior sample {ql 924 by first generating 
sil, from the filtered probability density p(s;., | wl, F;,), and then inputting 
the joint draw (7), sl) in expression (7.26). 

The result of this procedure is shown in Fig. 7.8 where we plot the one- 
day ahead VaR density of the MS-GJR model for two distinct time points in 
the out-of-sample observation window. We can notice that both densities are bi- 
modal, which is a consequence of the Markov-switching nature of the conditional 
variance process. At time t = 2’501, the VaR density gives a higher probability 
to larger (in absolute value) VaR values. This suggests that, at that particular 
point in time, the probability of being in the high volatility state is higher than 
being in the low-volatility regime. At time t = 3’500, the bimodality of the den- 
sity is slightly less pronounced. In this case, the VaR density puts more mass on 
smaller VaR values (in absolute value). This graph shows that the density of the 
VaR. has a particular shape in the case of the MS-GJR model. In this context, it 
would be interesting to determine if the loss function of an agent, and therefore 
the location of his optimal Bayes estimate within the VaR density, would have 
any influence on the forecasting performance of the model. 

In order to address this question, we consider different loss functions and 
determine the Bayes point estimates for the VaR by solving the optimization 
problem (6.10) of page 85. The loss functions we consider are the Linex with a 
parameter a € {—3,3}, the absolute error loss (AEL) as well as the squared error 
loss (SEL); the reader is referred to Sect. 6.4.4 for further details. We recall 
however that the Linex function with a positive parameter could be attributed 
to a regulator or risk manager whose aim is to avoid systematic failure in risk 
measure estimation. On the contrary, a negative parameter could be attributed 
to a fund manager who seeks to save risk capital since it earns little or no 
return at all (see Sect. 6.3.1 for details). The AEL and SEL correspond to the 
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perspective of an agent for whom under- and overestimation are equally serious. 
The SEL leads, however, to a larger penalty for larger deviations from the true 
value compared to the AEL function. 

The VaR risk measure obtained with the different loss functions are then 
tested over the 1’300 out-of-sample observations. To test the adequacy of the 
point estimates to reproduce the true VaR, we rely on the forecasting method- 
ology of Christoffersen [1998] as this was done in the preceding section. The 
results are reported in Table 7.5 whose second column gives the observed num- 
ber of violations and the third, fourth and fifth columns report the p-values for 
the tests of correct unconditional coverage (UC), independence (IND) and cor- 
rect conditional coverage (CC), respectively. From this table, we note first that 
the observed number of violations is close to the expected value for the Linex 
function with parameter a = 3. In this case, the test of correct unconditional 
coverage, at the 5% significance level, is never rejected. On the contrary, the 
Linex function with parameters a = —3 leads to the rejection of the null for risk 
levels 6 = 0.95 and @ = 0.99. The null hypothesis is also rejected for the AEL 
and SEL point estimates at risk level ¢ = 0.95, where the estimates systemat- 
ically underestimate (in absolute value) the true VaR. The joint hypothesis of 
correct unconditional coverage and independence is rejected at the 5% signifi- 
cance level for all functions, except the Linex with a = 3 and the SEL at risk 
level ¢ = 0.9. 

From what precedes, we can thus conclude that parameter uncertainty has to 
be taken seriously in the context of MS-GARCH models. In particular, the choice 
of a given point estimate within the VaR density has a significant impact on 
the forecasting performance of the model. A regulator (Linex a = 3) whose VaR. 
point estimate are conservative, would conclude to a good performance of the 
model while a fund manager (Linex a = —3) would systematically underestimate 
(in absolute value) the true VaR. 
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Table 7.5. Forecasting results of the VaR point esti- 
mates for the MS-GJR model.* 


¢ = 0.90, E(V,*) = 130; 


Loss 2 # UC IND CC 
Linex (a = 3) 130 1.000 0.018 0.061 
Linex (a=—3) 140 0.361 0.011 0.025 
AEL* 133. 0.782 »-—«0.011 0.039 
SEL 131 0.926 0.015 0.053 
¢ = 0.95, E(V,*) = 65; 

Loss 2 Hf UC IND CC 
Linex (a = 3) 71 0.452 0.270 0.410 
Linex (a=—3) 87 0.008 0.171 0.011 
AEL* 84 0.020 0.228 0.033 
SEL 83. 0.028 = 0.249 ~— 0.046 
= 0.99, E(V,*) = 13; 

Loss 2 # UC IND CC 
Linex (a = 3) 11 0.567 NA NA 
Linex (a=-3) 21 0.041 NA NA 
AEL? 17 «0.287 NA NA 
SEL? 14 0.783 NA NA 


* ¢: risk level; E(V,’): expected number of violations; 
#: observed number of violations; UC: p-value for the 
correct uncoverage test; IND: p-value for the indepen- 
dence test; CC: p-value for the correct conditional cov- 
erage test; NA: not applicable. 
, Absolute error loss function. 

Squared error loss function. 
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Fig. 7.8. Density of the one-day ahead VaR at risk level ¢ = 0.95 for the MS-GJR 
model at two time points in the out-of-sample observation window. Gaussian kernel 
density estimates with bandwidth selected by the “Silverman’s rule of thumb” criterion 
[see Silverman 1986, p.48]. Both graphs are based on 10’000 draws from the joint 
posterior density of the MS-GJR model parameters. 


152 7 MS-GJR(1,1) Model with Student-t Innovations 


7.7 Maximum Likelihood estimation 


We conclude this chapter with some comments regarding the Maximum Likeli- 
hood (henceforth ML) estimation of Markov-switching GARCH models. In this 
case, the estimation is handled as in Hamilton [1994, p.692], where the algorithm 
turns out to be a special case of the Expectation Maximization (henceforth EM) 
algorithm developed by Dempster, Laird, and Rubin [1977]. The classical ML 
approach cannot be applied directly, as the marginal likelihood where the latent 
process {s,} is integrated out, is not available in closed form. The estimation 
procedure is therefore decomposed into two stages. The first step consists in 
estimating the sequence of filtered probabilities {P(s; = k | /,F:-1)}#., for a 
fixed set of of parameters w. The second step maximizes the observed likelihood 
L(w | y) in expression (7.21) given this sequence of probabilities. The procedure 
is iterated until a given convergence criterion is satisfied. General results avail- 
able for the EM algorithm indicate that the likelihood function increases in the 
number of iterations. 

While apparently straightforward to handle, the ML estimation has practical 
drawbacks. Indeed, the EM algorithm guarantees a convergence to a local maxi- 
mum of the likelihood, but not necessarily to the global optimum. As reported in 
Hamilton and Susmel [1994], many starting points are required to end up with 
a global maximum. Furthermore, the covariance matrix at the optimum can be 
extremely tedious to obtain and ad-hoc procedures are often required to get re- 
liable results. E.g., Hamilton and Susmel [1994] fix some transition probabilities 
to zero in order to determine the variance estimates for some model parameters. 
Finally, testing the null of K versus K’ states is not possible within the ML 
framework since the regularity conditions for justifying the y? approximation of 
the likelihood ratio statistic do not hold. 

For comparison purposes, we estimate the MS-GJR model via the ML tech- 
nique. The iterative procedure described previously has been run using 20 ran- 
dom starting values. In all cases, the optimizer has been trapped in a local 
maximum or even did not converge. The convergence has only been achieved 
by starting the ML optimizer at the posterior mean ~ (see Table 7.1, p.127) 
obtained with the Bayesian approach. 

In Fig. 7.9, we display the marginal densities obtained via Gaussian kernel 
density estimates, for the model parameters obtained through the Bayesian ap- 
proach (in solid lines) and the ML approach (in dashed lines). From these graphs, 
we note that the ML estimation leads to more peaked density estimates and 
therefore underestimates the parameter uncertainty. Furthermore, compared to 
the Bayesian approach, the ML approach underestimates the values of the com- 
ponents of vector a whereas the components of @ are overestimated. 
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Conclusion 


Single-regime and regime-switching GARCH models are widespread and essen- 
tial tools in financial econometrics and have, until recently, mainly been es- 
timated using the Maximum Likelihood (henceforth ML) technique. However, 
the Bayesian estimation of these models has several advantages over the classical 
approach. First, computational methods based on Markov chain Monte Carlo 
(henceforth MCMC) procedures avoid the common problem of local maxima 
encountered in the ML estimation of these models. Second, the exploration of 
the joint posterior distribution gives a complete picture of the parameter un- 
certainty and this cannot be achieved via the classical approach. Third, exact 
distributions of nonlinear functions of the model parameters can be obtained at 
low cost by simulating from the joint posterior distribution. Fourth, constraints 
on the model parameters can be incorporated through appropriate prior specifi- 
cations; in such a setting, imposing the constraint of covariance stationarity for 
the regime-switching GARCH model, for instance, is straightforward. Finally, 
discrimination between models can be achieved through the calculation of model 
likelihoods and Bayes factors. All these reasons strongly motivate the use of the 
Bayesian approach when estimating GARCH models. 

The choice of the algorithm is the first issue when dealing with MCMC 
methods and it depends on the nature of the problem under study. In the case 
of GARCH models, due to the recursive nature of the conditional variance, the 
joint posterior and the full conditional densities are of unknown forms, whatever 
distributional assumptions are made on the model disturbances. Therefore, we 
cannot use the simple Gibbs sampler and need more elaborate estimation proce- 
dures. The sampling schemes adopted in this book are based on the approach of 
Nakatsuma [1998, 2000] which has the advantage of being fully automatic and 
thus avoids the time-consuming and difficult task of tuning a sampling algo- 
rithm. In addition, this approach is easy to extend to regime-switching GARCH 
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models. In this case, the parameters in each regime can be regrouped and up- 
dated by blocks which may enhance the sampler’s efficiency. 

This book presented in detail methodologies for the Bayesian estimation of 
single-regime and regime-switching GARCH models. It proposed empirical ap- 
plications to real data sets and illustrated some interesting probabilistic state- 
ments on nonlinear functions of the model parameters made possible under the 
Bayesian framework. 

The work was introduced in Chap. 1 with a review of GARCH modeling 
and a presentation of the advantages of the Bayesian approach compared to 
the traditional ML technique. In Chap. 2, we proposed a short introduction to 
the Bayesian paradigm for inference and gave an overview of the basic MCMC 
algorithms used in the rest of the book. 

In Chap. 3, we considered the Bayesian estimation of the parsimonious but 
effective GARCH(1, 1) model with Normal innovations. We detailed the MCMC 
scheme based on the methodology of Nakatsuma [1998, 2000]. An empirical 
application to a foreign exchange rate time series was presented where we com- 
pared the Bayesian and the ML point estimates. In particular, we showed that 
even for a fairly large data set, the estimates and confidence intervals are dif- 
ferent between the methods. Caution is therefore in order when applying the 
asymptotic Normal approximation for the model parameters in this case. We 
performed a sensitivity analysis to check the robustness of our results with re- 
spect to the choice of the priors and tested the residuals for misspecification. 
Finally, we compared the theoretical and sample autocorrelograms of the process 
and tested the covariance and strict stationarity conditions. 

In Chap. 4, we analyzed the linear regression model with conditionally 
heteroscedastic errors which allowed the introduction of lagged dependent vari- 
ables in the modeling; moreover, we considered the GJR(1, 1) model to account 
for asymmetric responses to past shocks in the conditional error variance pro- 
cess. We fitted the model to the Standard and Poors 100 (henceforth S&P100) 
index log-returns and compared the Bayesian and the ML estimations. We per- 
formed a prior sensitivity analysis and tested the residuals for misspecification. 
Finally, we tested the covariance stationarity condition and illustrated the dif 
ferences between the unconditional variance of the process obtained through 
the Bayesian approach and the delta method. In particular, we showed that the 
Bayesian framework leads to a more precise estimate. 

In Chap. 5, we extended the linear regression model further with the in- 
troduction of Student-t-GJR(1, 1) errors. An empirical application based on the 
S&P100 index log-returns was proposed with a comparison between the joint 
posterior and the asymptotic Normal approximation of the parameter estimates. 
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We performed a prior sensitivity analysis and tested the residuals for misspeci- 
fication. Finally, we analyzed the conditional and unconditional kurtosis of the 
underlying time series. 

In Chap. 6, we presented some financial applications of the Bayesian esti- 
mation of GARCH models. We introduced the concept of Value at Risk (hence- 
forth VaR) risk measure and proposed a methodology to estimate the density 
of this quantity for different risk levels and time horizons. We reviewed some 
basics in decision theory and used this framework as a rational justification for 
choosing a point estimate of the VaR. We showed how agents facing different 
risk perspectives could select their optimal VaR point estimate and documented 
substantial differences in terms of regulatory capital between individuals. Fi- 
nally, we extended our methodology to the Expected Shortfall (henceforth ES) 
risk measure. 

In Chap. 7, we extended the single-regime GJR model to the regime- 
switching GJR model (henceforth MS-GJR); more precisely, we considered an 
asymmetric version of the Markov-switching GARCH(1, 1) specification of Haas 
et al. [2004]. We introduced a novel MCMC scheme which can be viewed as a 
multivariate extension of the sampler proposed by Nakatsuma [1998, 2000]. As 
an application, we fitted a single-regime and a Markov-switching GJR model to 
the Swiss Market Index log-returns. We used the random permutation sampler 
of Frithwirth-Schnatter [2001b] to find suitable identification constraints for the 
MS-GJR model and showed the presence of two distinct volatility regimes in 
the time series. By using the Deviance information criterion of Spiegelhalter 
et al. [2002] and by estimating the model likelihood using the bridge sampling 
technique of Meng and Wong [1996], we showed the in-sample superiority of the 
MS-GJR model. To test the predictive performance of the models, we ran a fore- 
casting performance analysis based on the VaR. In particular, we compared the 
MS-GJR model to a single-regime GJR model estimated on rolling windows and 
concluded to the superiority of the MS-GJR specification. Finally, we proposed 
a methodology to depict the density of the one-day ahead VaR and presented a 
comparison with the traditional ML approach. 

This book proposed two main contributions which are of practical relevance 
for both market participants and academics. First, we proposed a novel MCMC 
scheme to perform the Bayesian estimation of a Markov-switching model with 
Student-t innovations and asymmetric GJR specifications for the conditional 
variance in each regime. It allows to reproduce many stylized facts observed 
in financial time series, such as volatility clustering, conditional leptokurticity 
and Markov-switching dynamics. Furthermore, it helps to identify whether the 
leverage effect is different across regimes. Our multivariate extension of the ap- 
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proach proposed by Nakatsuma [1998, 2000] leads to a fast, fully automatic and 
efficient estimation procedure compared to alternative approaches such as the 
Griddy-Gibbs sampler. Practitioners who need to run the estimation frequently 
and/or for a large number of time series should find the procedure helpful. Sec- 
ond, we provided a manner to approximate the multi-day ahead VaR and ES 
densities when the underlying process is described by a GARCH model. Our 
methodology gives the possibility to determine the term structure of these risk 
measures and to characterize the uncertainty coming from the model parame- 
ters. In our empirical application, we documented that the choice of the model 
disturbances has a significant impact on the shape of both risk measures’ den- 
sities and this effect gets larger as the time horizon increases. Moreover, the 
densities are strongly left-skewed, which implies substantial differences in risk 
capital allocation for agents facing different risk perspectives (e.g., risk and fund 
managers). 


Suggestions for further work 


This study has raised many questions and suggests interesting further avenues 
of research. 

First, in light of the results obtained in Chap. 6, additional work is required 
to assess the performance of multi-day ahead VaR models. This is essential for 
risk management purposes since the multi-day ahead VaR lies at the heart of the 
risk capital allocation’s framework. The development of powerful methodologies 
for testing the VaR is the subject of current researches and we refer the reader to 
Berkowitz, Christoffersen, and Pelletier [2006], Kaufmann [2004], Seiler [2006], 
Zumbach [2006] for details. A natural extension of our analyses would consider 
the model uncertainty in addition to the parameter uncertainty. The Bayesian 
approach provides a natural framework for tackling this issue. 

Second, regime-switching GARCH models might be compared to the class 
of stochastic volatility (henceforth SV) models [see, e.g., Jacquier, Polson, and 
Rossi 1994, Kim et al. 1998]. While SV models are highly flexible (two different 
processes drive the dynamics of the underlying time series and the dynamics 
of the volatility), they are more difficult to estimate efficiently. Determining 
whether this additional flexibility results in a superior predictive ability would 
therefore be of interest. 

Finally, in our study of the Markov-switching GJR model, we have consid- 
ered a fixed transition matrix for the state process. Consequently, the expected 
persistence of the regimes is constant over time, which is questionable. In a 
more general formulation, we could allow the transition probabilities to change 
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over time depending on some observables [see, e.g., Bauwens et al. 2006, Gray 
1996]. This would allow determining whether some exogenous variables trigger 
the switching mechanism of the volatility process. The transition probabilities 
could also depend on the past level of the volatility. In this case, we could re- 
produce an additional feature of the volatility behavior, namely the fact that 
the probability of returning to a normal (i.e., low or medium) volatility regime 
increases after a high upward jump in the volatility level [see, e.g., Bauwens 
et al. 2006, Dueker 1997]. 


A 


Recursive Transformations 


In this appendix, we demonstrate the recursive transformations introduced in 
Chaps. 3, 4 and 5 which are used to express the function z;(a) as a linear 
function of parameter a. The process for the conditional variance is based on 
observations {y,}. Results can be straightforwardly extended when a linear com- 
ponent is included in the model by considering instead the process {u;} where 


Ut = Ye — X'Y. 


A.1 The GARCH(1, 1) model with Normal innovations 


First, let us recall that, in the case of the GARCH(1, 1) process, the expression 
for the conditional variance of y; is given by: 


ht = a0 + OY; 1 + Bht-1 


where hg = yo = 0 for convenience. As shown in Sect. 3.2, the GARCH(1, 1) 
model with Normal innovations can be expressed as an ARMA(1, 1) model for 
the squared observations {y?} and approximated as follows: 


ye = a9 + (a, + B)yZ_, — Breit 2 


where {2;} is a Martingale Difference process. Let us define v1, = y? for notational 
purposes. The variable z,; can then be written as: 


Zt = Ve — AQ — (ay + B)vt-1 + Bui (A.1) 
where Up = 20 = 0. 
Proposition A.1. Upon defining the following recursive transformations: 


=1+6l_y 


* * 
Ve = U-1t+ Boy 


(A.2) 


where lj = vo = 0, expression (A.1) can be written as follows: 
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Ze =U,— (i; uf ja (A.3) 


where a = (ap a1)’. The function z,(a@) in (A.1) can therefore be expressed as 
a linear function of the 2 x 1 vector a. 


Proof. By induction: 
Beginning step: 
For t = 1, we have: 


A.2) 


v, — (UF vije@ = vy — (1 Oa = v1 — a0 a 


Z1- 


Assumption step: 
Let us assume that expression (A.3) is satisfied for t = k. 


Induction step: 
For t= k+1 we have: 


* * (A.2) * * 
Uk+1 — (Ujaa Verio =" Ups1 — (1+ BZ up + Bupa 


= VUk+ Qo Q1Uk B(aoly t ayuz) 


= UR — Ao — Que — B(K up )a 


Uk+1 — Qo — A1Uk — B(UE — Zk) 
= Ve+1— a0 — (a1 + B)UR + Bz 


A.2 The GJR(1, 1) model with Normal innovations 


First, let us recall that, in the case of the GJR(1,1) model, the expression for 
the conditional variance of y; is given by: 


hy = a0 + (arlyy,_, 50} + Callty,_,<0})Y2-1 + Bhe-1 


where ho = yo = 0 for convenience. As shown in Sect. 4.2.2, the GJR(1, 1) 
model with Normal innovations can be transformed for the squared observations 
{y?} and approximated as follows: 


Ye = A + (Orly, >0} + Gallty,_. <0} + Buti — Beri + % 


where {2;} is a Martingale Difference process. Let us define 1, = y? for notational 
purposes. Then, the variable z; can be written as: 


% = Uz — A — (QU fy,_, 0} + Calty,_, <0} + F)ve—-1 + B21 (A.4) 


where Up = 20 = 0. 
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Proposition A.2. Upon defining the following recursive transformations: 


i=14+ ply, 
Ve = Ut-ilty,_1>0} + BU¢-1 (A.5) 


up = Ve-ify,_1 <0} + Bug, 
where lj = vg = v* = 0, expression (A.4) can be written as follows: 
ze =U, — (Yi uf vf*)a (A.6) 


where @ = (Qo a1 Q2)'. The function z:(a) in (A.4) can therefore be expressed 
as a linear function of the 3 x 1 vector a. 


Proof. By induction: 


Beginning step: 
For t = 1, we have: 


0 eK 


vy — (Uf uy vy i Oe =a a SF oa; 


Assumption step: 
Let us assume that expression (A.6) is satisfied for t = k. 


Induction step: 
For t= k-+ 1, we have: 


* a ok 
Uk+1 — Can Vet. VEt1)a 


We ies os (1+ Bl, vilty,>0} + Bu, velty, <0} + Bug’ )o 

= VUpt1 — Ao — A1UEl Ey, 50} — O2UKlLy, <0} 

B(aol, + arug + a2v;,*) 
Ve-+1 — A — AV ry, 50} — A2Vel ty, <0} — B(Uy vg VE" )au 

rae) Vk+1 — Ao — AU] Ly, 50} — H2VKl fy, <0} — B(VE — 2x) 

= vpe+1 — A — (ilfy, 0} + Callty, <o} + B)UR + Bex 
(A.4) 

as a 


A.3 The GJR(1,1) model with Student-t innovations 


As shown in Sect. 5.2.2, the GJR(1,1) model with Student-t innovations, de- 


noted by {y}, can be transformed in a new sequence {v,}, where vu; = ue with 


TT] = mo and 9= a The process {v,} can then be approximated as follows: 


Ue = 0 + (nl ty, >0} + Oallty,_1<0})T#-1Ut-1 + Bur-1 — B2t-1 + 2 
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where {z;} is a Martingale Difference process. From this expression, the variable 
z, can be written as: 


Zp = U~— AQ — [(arlyy,_,>0} + Op. 40.) T= + 6] Up-1 + Bzt-1 (A.7) 
where Up = 20 = 0. 
Proposition A.3. Upon defining the following recursive transformations: 
i, = 1+ 6G, 
vu; = Yi fy,_1>0} + Buy) (A.8) 


eK + OD 2K 
U = Yl, 1 <0} a Buy 


where 1} = vi = u6* =0, expression (A.7) can be written as follows: 
ze =U, — (Yi uf vf*)a (A.9) 


where @ = (ao Q1 G2)’. The function z,(@) in (A.7) can therefore be expressed 
as a linear function of the 3 x 1 vector a. 

Proof. By induction: 

Beginning step: 


For t = 1, we have: 


A) 


vy — (If vt vf )a “= v1 — (10 0a = — ag | Zz. 


Assumption step: 
Let us assume that expression (A.9) is satisfied for t = k. 


Induction step: 
For t= k+ 1, we have: 


nti — (lip1 Vet Vet) 
(A.8) v = 1 i 21 * 27 me 
=" vurt1 — (1+ Ble Yel ty,>0y + BE Yel{y, <0} + BUR") 
= Ue+1 — 0 — AY Kl fy, 20} — C2YEL{y,<0} 
B(aoly, + avy + a2v;") 


2 2 HE eae 
= VUR+1 — % — O1Y EA Ly, 50} — C2UEH Ly, <0} — (Lj, UE VE Oe 


= VUke+1 — A — OY, {yn >0} — a2Yi. {y~<O} — B(vE = Zk) 


2: 2 
VEt1 — 9 — A1Y gl Ey, >0} — A2YKl fy, <0} — BUR + BZ 


= vpqi — a0 — [(arl ty, >0} + oly, <o}) Te + Blur + Be 


B 


Equivalent Specification 


In this appendix, we demonstrate the equivalent specification introduced at the 
end of Sect. 5.2. 


Proposition B.1. The following model: 
ye = €¢(weh,) 7? for | rere 
iid 
ii —2 
Wt ras AS) (5 . ) 


ar, 
where hy = hi(a, 3) is a GARCH scedastic function, is equivalent to: 
ye = eh!” for t=1,...,T7 
e, S*(0,1) 


where S*(0,1) denotes the standardized Student-t density, i.e., its variance is 
one. 


Proof. First, let us regroup the model parameters into w = (a, (,v) for nota- 
tional purposes. In specification (B.1), the variables ~ are independent and 
identically distributed from an Inverted Gamma density with parameters 5 and 


v—2. 
ae . 
y—2\?2 V\)~1  —¥-4 (v — 2) 
(3) PQ) eee Le 
Bee ( 2 ) n(S)) exp | | 
and the joint density of the T x 1 vector w = (m1 --- wr)’ is therefore given by: 
2\ 2 se (v — 2) 
y- V = eer eee y= 
= r(5) ; “ . (B2 
pew») =(*5*) ° [r(5)] “Ter tem |S]. ay 


Based on the T x 1 vector of observations y = (y,--- yr)’, we can express the 
likelihood function of (4, @) as follows: 
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T 2 
ht) =1/2' —1_Ut 
L(,@ | y) ei tev) t xD 2 coe 


and, by using the Bayes theorem, we obtain the following joint posterior: 


y—2 a VY =P a —1/2 
nosey (52) * br (5))" (IL#"") 
as _ (+3) 1 Y, 
x] 2 exp | : 
t=1 


Now, by using the following result: 


e ~2 Q\ “F 
| x */? exp I-=| ae = (45 )x (=) 


we can integrate (B.3) with respect to vector @ to get the following expression: 


mone 2)" (sy) 


9 Tost I eu. (rh) 


where: ‘ 
ipa a9) = WH 8) & p+ | 
V 


Some simplifications of expression (B.4) yield: 


Pate) Ie b+ el 


which if proportional to the likelihood function of parameters ~ when observa- 
tions yp ~” S*(0, hy) where hy = hi(a, 8). 


— w+) 
2 


The specification (B.1) gives an additional way of handling the Bayesian estima- 
tion of GARCH models with Student-t innovations. It has the appealing aspect 
that no additional scale factor 9 = v—2 needs to be included in the modeling. 
However, the simulation scheme presented in Deschamps [2006], Geweke [1993] 
needs to be slightly modified, as shown hereafter. 

In our application, we aim to generate efficiently draws for the degrees of 


freedom parameter v. The target density is obtained as follows: 
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piv | @) x p(w | v)p 


wee e 


(y— 


(v) 
“| aT (7 tway 
2)" (Ue) 


- exp [ —Xv- 6) Ips} 


x on |S 


(42) 
where we can express Te ee as: 


[= 


wre) 


+ 
Il 
un 


4s 
oO 
ca 
Ke) 
=F 
4g 
us 


+ 
Il 
un 


lI 
4s 
oO 
ia 
be) 
y= 7 -—— 
eh 
= 
+ 
bo 
bl 
— 
uF 


k(v) = (7 % *) 7 Ir (5) = exp[—yu] Iss} 


where: 


T 
==) > [na+a;'] +2. 


t=1 


Note that, since the function Inj + a7! is minimized at w = 1, we have that 
p> t+r>F. 

Following Deschamps [2006], the sampling density is a translated Exponen- 
tial with kernel density function given by: 


g(v; 4,0) = exp [—p(v — 6) Ips53 (B.5) 


where the parameter p is chosen to maximize the acceptance probability. Fol- 
lowing Geweke [1993], we can determine the value for this parameter. Given the 
usual regularity conditions, a necessary condition is that jz is part of a solution 
of the following system: 
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2 [Ink(v) — Ing(v; p, 6)] =0 (B.6a) 


0 
ae Ing(v; 4,0) =0 . (B.6b) 


Expliciting (B.6a) yields: 


ees eee 


where W(z) = 2@2@ denotes the Digamma function, while solving (B.6b 
dz 
yields: 
1 1 0) 
v=-+6 7 (B.8) 


Furthermore, we note that in expression (B.7), the function: 


n(*57) + (5) -¥@) 


is monotone decreasing from co to 1 on the ]2,00[ interval. Hence, since y > , 
there exists an unique pz satisfying (B.7). Now, inserting (B.8) in expression (B.7) 
yields: 


- 1+ p(d — 2) 1+ pd 1+ pd 
In + W + u—-p=O0 
2 Qu 1+ u(d — 2) Qu 
and solving for jz gives the optimal parameter ji for the efficient sampling scheme. 


The value fi can be found by standard iterative methods. Then, a candidate v* 
is sampled from (B.5) with parameter fi and accepted with probability: 


lu LL 
Tate S (1+) 
a te) T exp [1 — StH) | 


2 2 LL 


Substituting for k(v*), s(77,d) and g(v*; 7, 6) in the expression of the acceptance 
probability yields: 


x 
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C 


Conditional Moments 


In this appendix, we demonstrate the propositions for the conditional moments 
of the cumulative return y;,5 = yy Yt+i used in Sect. 6.2.2. We consider the 
case where the process {y,} is described by a GARCH(1,1) model for ease of 
exposition but the methodology can be extended, upon modifications, to higher 
order GARCH models as well as asymmetric specifications. We recall that the 
scedastic function of the GARCH(1,1) model is given by: 


ht = a0 + O1Yy;_1 + Bhe-1 


where ho = yo = O for convenience. For the model disturbances, we consider 
standardized Normal and Student-t innovations. We define E,(e) = E(e | F;) 
and suppress the dependence of the model parameters for notational pur- 
poses. The GARCH(1,1) parameters a@ = (ao a@))' and ( are regrouped into 
w = (a, 3). In the case of Student-t innovations, ) = (a, 3,v). Moreover, the 
p-th conditional moment of y;,, is denoted by Kp. 

The following properties will be used henceforth: 


A. the errors ¢; are iid (i.e., independent and identically distributed); 


B. Ex(€z4;) = 0 for i > 1 (i.e., centered distribution); 


C. E,(e?,;) =1 for i >1 (i.e., unit variance); 


D. E,(e,;) = 0 for i > 1 (i.e., symmetric distribution); 


E. the conditional variance h; is known given F;_, i.e., it is pre- 
dictable with respect to the natural filtration of the process {y¢}. 


To keep the calculations similar for both Normal and Student-t innovations, 
we assume unit variance for the disturbances €, as emphasized in property C. 
This has an implication for the fourth conditional moment of the innovations 
Ke = E,(ef,,;) for i > 1. Indeed, in the Normal case, «- = 3 while in the 
normalized Student-t case, K- = 3(v — 2)/(v — 4). In comparison, the fourth 
moment of the usual Student-t is 3v?/(v — 4)(v — 2). 
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Proposition C.1 (First conditional moment). For horizon s > 1, the value 
of the first conditional moment kK, is zero. 


Proof. 


ki = Ex(y,s) = Ex (>: ws] a x “e(Ye+i) = 0 


i=1 i=1 


since for 1 <i < 5s we have: 


: : An ‘i 
t(Yers) = eleepalesa) & be(Et44) mre = 


Proposition C.2 (Second conditional moment). For horizon s > 2, the 
value of the second conditional moment kz is: 


Ss 


Kk2 = S- t(he+i) 


i=l 


where Ue(he i) =aot pi lt (Rei 1) with p= (a1 + B). 


Proof. For horizon s > 2, the second power of the cumulative log-returns y;. is 
given by: 


2! , 
2 a ts 
YVits = 5 ayy * Met Vets 
Z . b1s...bg: 
V15 +++ ts 


fit... $ig=2 


s 
= » Vere t2 s, Yer Yetj 
i=1 


1<i,j<s 
i<j 
and the second conditional moment Kz is obtained by taking the conditional 
expectation as follows: 


Ss 


ko = SEs (y2,,) +2 5) Ex(yesi yess) - (C.1) 
i=l 1<i,j<s 
i<j 


Let us consider the first term in expression (C.1). For 1 <i < s, we have: 


i A Cus 
bt (Yera) = be(Epa ges) = be (Eg4a) be(heri) = Ee (Resa) - 


For the second term in expression (C.1), since 7 < j, we have: 


7 7 1/2, AJ, 1/2, B 
(yer Yes) = Ex(yers cerghisy) = Ex(uerahe,; Eu(erss) = 0. 


Hence, expression (C.1) simplifies to kp = )>7_, E:(he4;) where E;(h14;) can be 
expressed recursively as follows: 
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T(he+i) = Ao + pike (he+i-1) 


with E;(hi41) = he+1 since this value is known given F;. The parameter p, is a 
function of the model parameters w. Its expression is found by noting first that 
the conditional variance at time t can be written as follows: 


2 
Resi = G0 + O1Ye 44-1 + Ghtsi-1 
2 
= a9 + (1444-1 + 9) Regi-1 


= a9 + Yt4i-1heti-1 (C.2) 


where y; = (aie? + 3). By taking the conditional expectation of (C.2), we 
obtain: 


it(Ae+s) = Go Tr bt( Yeti thesi 1) 


\|> 


ao + Ex(Ye4i—1) Ex (hi4i-1) 
ao + pry (hi+i-1) 


‘ <i Cc 
where pi = bt ( Peta 1) = be(O1E74 5-1 + 2B) = QA] be(E7 44-1) + B =A,+ B. 


Proposition C.3 (Third conditional moment). For horizon s > 3, the 
value of the third conditional moment K3 is zero. 


Proof. For s > 3, the third power of the cumulative log-returns y;,, is given by: 


3! 
3 ot sae ts 
Yt,s = » hl ..g) < Yeti Vets 


7 : Jie bge 
U1, +++ bs 
it... tts=3 
s 
= Py t3 : ; +6 Vers 
= Yeti Yes t+ Yt+i Yt+j Yt+k 
j=1 1<i,j<s 1<i,j,.kXs 


iFj i<j<k 
and the third conditional moment «3 is obtained by taking the conditional ex- 
pectation as follows: 


Ss 


K3 = a Us (Yepi) + 35 (Vera Yery) +6 <a le(Yr+i Yet 7 Yerk)- — (C.3) 
i=1 1<i,j<s 1<i,j,k<s 
iFj i<j<k 


Let us consider the first term in expression (C.3). For 1 <i < s, we have: 


3/2, A 3/2, D 
ve (e44) = Ee(e3, she) = Ee (62, JEe(he4) 20. 


For the second term, we need to distinguish two cases. First, when 7 < j, we 
obtain: 


1/2, A 1/2 B 
bs (Ye si Ut Lg) = ete ety jhe) = celyepahes a) be (Et+) =0. 
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In the case where j < i, we have: 
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The last equality follows from: 


Mt+j [eer Ve44] 


B,D 


Lt-+j—1 ler j(Oet ys; + 6)| 


on Eesj—s[epss] + @Ee+5-1lee+,] 


= 0. 
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Finally, let us consider the last term in expression (C.3). Since i < j < k, we 


have: 
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Each term in expression (C.3) vanishes so that the third conditional moment is 


Zero. 


Proposition C.4 (Fourth conditional moment). 


value of the fourth conditional moment ka is: 
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where: 
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with py = (a1 + 8), po = (Kea1 + 8), T1 = 2agp1 and T2 = K-02 + B(2a1 + B). 


Proof. For s > 4, the fourth power of the cumulative log-returns y;,, is given 
by: 
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and the fourth conditional moment «4 is obtained by taking the conditional 
expectation as follows: 
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Let us first consider the terms in (C.6) which vanish. For the second term in 
expression (C.6), when i < j, we have: 
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Let us now consider the fourth term in expression (C.6). Assuming that i <j <k 


yields: 
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The same holds fori< k <j, j <i<kandk <i < j. It remains to consider 
the cases where 7 is the greatest integer, 7.e., 
assuming (without loss of generality) that 7 < k < i we have: 
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Finally, let us consider the last term in expression (C.6). By assuming (without 
loss of generality) that i<j <k <1, we have: 


Ut (Yer Yerg Yerk Yeti 


i= 


1/2 
be(Yeré Yes York M44) 


be (E441) = 0. 


With these intermediate results, expression (C.6) can be simplified as follows: 
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Let us explicit the first term in expression (C.7). We have: 
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where 7, = 2aop1, p1 = Ex(yi+i-1) and 72 = Ey(p?,,_1). Expression (C.8) can 
be computed recursively since hy: is known given F;. Furthermore, we can 
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It remains to consider the last term in expression (C.6). Since i < j, we have: 
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where the conditional expectation 
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Replacing (C.10) in expression (C.9) yields: 
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which can be computed recursively since hi is known given F;. 


Computational Details 


The algorithms have been written in the R language, version 2.4.1 [see R, De- 
velopment Core Team 2007], with some subroutines implemented in C in order 
to speed up the simulation procedure; this is required in the case of GARCH 
models due to the recursive nature of the conditional process which drastically 
slows down the computations with high level interpreted languages. Moreover, 
the validity of the algorithms as well as the correctness of the computer code 
were verified by a variant of the method proposed by Geweke [2004]. We refer 
the reader to the end of Sect. 3.2.2 for further details. The R packages boa 
0-1.3, mvtnorm 0-7.5, coda 0.10-7, sandwich 2.0-0, Imtest 0.9-18, MASS 
7.2-32 have also been used to perform specific tasks. The R program itself and 
packages are available from CRAN at http://CRAN.R-project.org. 

Regarding the ML estimation of the models, the likelihood functions were 
maximized using the R function nlminb which performs unconstrained and con- 
strained optimization using PORT routines. In some cases, the procedure was 
initialized using the DEoptim function provided by the R package DEoptim 
1.01-2 [see Ardia 2007b]. This function performs a global (robust) optimization 
based on the Differential Evolution algorithm. The reader is referred to Price, 
Storn, and Lampinen [2006] for further details. 

Finally, we note that the R package bayesGARCH will soon be available 
from CRAN [see Ardia 2007a]. This package allows the Bayesian estimation of 
the GARCH(1, 1) model with Student-t innovations. The underlying algorithm is 
based on Nakatsuma [1998, 2000] for the generation of the scedastic function’s 
parameters. The generation of the degrees of freedom parameter is achieved 
following Deschamps [2006], Geweke [1993]. By using a translated Exponential as 
a prior on the degrees of freedom parameter, Normal innovations can be obtained 
as a special case of the sampler. Moreover, the function addPriorConditions 
allows to add any constraints on the model parameters in the M-H sampler. 


Abbreviations and Notations 


Most of the notation as well as dimensions of vectors and matrices are clearly 
defined in the text where it is used. Occasionally, a mathematical symbol has 
been assigned to a different object. 


Abbreviations 

ACF Autocorrelation function 

AEL Absolute error loss 

ARCH AutoRegressive Conditional Heteroscedasticity 
ARDS Adaptive Radial-Based Direction Sampling 
AR AutoRegressive 

ARMA AutoRegressive Moving Average 

BIC Bayesian information criterion 

BF Bayes factor 

BUGS Bayesian analysis Using Gibbs Software 
cont. Continued 

CC Conditional coverage 

CSC Covariance stationarity condition 
DEM/GBP Deutschmark vs British Pounds 

DIC Deviance information criterion 

EM Expectation Maximization 

ES Expected Shortfall 

FFBS Forward Filtering Backward Sampling 
GARCH Generalized ARCH 

GJR Asymmetric GARCH 

IF Inefficiency factor 

IND Independence 

Linex Linex loss 

MCMC Markov Chain Monte Carlo 

M-H Metropolis-Hastings 
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Abbreviations (cont.) 


ML Maximum Likelihood 

MS-GARCH  Markov-switching GARCH 

MS-GJR Markov-switching GJR 

NA Not applicable 

NSE Numerical standard error 

P&L Profit and loss 

SEL Squared error loss 

SMI Swiss Market Index 

S&P100 Standard & Poors 100 

S&P500 Standard & Poors 500 

SSC Strict stationarity condition 

UC Unconditional Coverage 

VaR Value at Risk 

VIX Volatility index of the S&P100 index 

Densities 

N (0,1) Univariate standard Normal density 

Na(, ©) d-dimensional Normal density with mean vector ys and 
covariance matrix 

S(p,07,v) Univariate standard Student-t density with mean ju, scale 
parameter o? and degrees of freedom v 

SS(,07,7,7) Univariate skewed Student-t density with mean p, scale 
parameter o”, degrees of freedom 7 and asymmetry pa- 
rameter 7 

TG(a, b) Inverted Gamma density with shape parameter a and rate 
parameter b 

D(n) Dirichlet density with parameter 7 

x? Chi-squared density with /& degrees of freedom 


Notations used throughout the book 
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“is generated from” 
‘is proportional to” 
‘is preferred over” 
‘is approximately equal to” 
‘Gs defined to” 
‘is equivalent to” 
“belongs to” 
Independent and identically distributed 
Set of real numbers 
Set of non-zero real numbers 
Set of real positive numbers 
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N* 


Set of non-zero natural numbers 


A scalar, a vector, a matrix, a parameter or a set of pa- 


rameters (depending on the context) 
Exponential 

Natural logarithm 

Factorial 

Absolute value 

Summation operator 

Product operator 

Multiplication operator or Cartesian product 
Integral 

Vector without its ith component 
Sequence (or set) of variables (or observations) 
Maximum value of a set 

Minimum value of a set 

Infimum of a set 

Number of elements in a set 
Determinant of a square matrix 
Inverse of a number or a square matrix 
Gamma function 

Digamma function 

Identity matrix of size d 

Information set up to time t 
Integrated of order d 

Indicator function 

Transpose of a vector or a matrix 
Maximum Likelihood estimate 

Sample average 

oth percentile of a sample 

Likelihood function 

Marginal (observed) likelihood function 
Density function 

Distribution function 

Posterior density 

Probability 

Conditional probability 

Expectation 

Conditional expectations 

Potential scale reduction factor 

Vector of zeros 

ARCH parameters 

Vector of ARCH parameters 

GARCH parameter 

Vector of linear regression’s parameters 
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Notations used throughout the book (cont.) 


Degrees of freedom parameter of the Student-t density 
Latent scale variable at time t¢ 

Vector of latent scale variables 

Sets (or vectors) of the model parameters 

a= (a y+ Q2) / 2 

Leverage effect coefficient: Aa = (az — ay) 

Length of the underlying time series (number of observa- 
tions) 

Dependent variable at time t 

Vector containing the observations of y; 

Model innovation at time t 

Model residual at time t 

Linear regression’s error at time t 

Vector of linear regression’s errors 

Number of exogenous or lagged dependent variables 
Vector of exogenous or lagged dependent variables at time 
t 
Matrix whose tth row is x; 

Conditional variance at time t 

Unconditional variance of the underlying process 
Diagonal matrices of conditional variances 

A scaling factor: 9 = 

Number of draws in the posterior sample 

Proposal density 

Previous draw in the sampler 

jth draw in the sampler 

New draw in the sampler 

Hyperparameters of the (truncated) Normal priors 
Parameters of the (truncated) Normal proposals 
Hyperparameters of the translated Exponential prior 
Kurtosis of the innovations 

Unconditional kurtosis of the underlying process 
Recursive transformations 

Vector of recursive transformations 

Matrix whose tth row is c/, 


Indicator variable at time t for the violation of the one- 
day ahead VaR at risk level @ 


Notations specific to Chap. 6 


o pf 
s 


Ut,s 


Risk level of the VaR and ¢° = 1— ¢ 

Time horizon (in days) 

Cumulated return over a s-day ahead horizon at time ¢ 
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Notations specific to Chap. 6 (cont.) 


Vector of s-day ahead log-returns at time t 

pth conditional moment 

¢th percentile of the N’(0,1) density 

éth percentile of the S(0,1,v) density 

Loss function 

True value of the forecast (or one-dimensional parameter) 
Point estimate of w 

Optimal (Bayes) point estimate under loss function 
Posterior risk function under loss 

Parameter of the Linex loss function 

Parameters of the Monomial loss function 

Statistical error: A = (@ — w) 

Number of out-of-sample observations 

VaR. point estimate 

One-day ahead VaR at time t for risk level ¢ 

s-day ahead VaR at time ¢ for risk level @ 

s-day ahead VaR point estimate at time t for risk level ¢ 
and loss function # 

Regulatory capital point estimate at time ¢ for loss func- 
tion & 

Indicator variable at time ¢ for the violation of the s-day 
ahead VaR. at risk level @ 

One-day ahead ES at time ¢ for risk level ¢ 

s-day ahead ES at time ¢ for risk level ¢ 


Notations specific to Chap. 7 


Kronecker product 
Hadamard product, i.e., element-by-element multiplica- 
tion 
Vectorization (column stacking) of a matrix 
Trace of a square matrix 
Number of regimes 
ARCH parameters in state k 
Vectors of ARCH parameters 
Vector containing the vectors of ARCH parameters 
GARCH parameter in state k 
Vector of GARCH parameters 
s,th column of the matrix I¢ 
Vector of ones of size Kv 
Vector of conditional variances at time t 
Parameter’s vector of the ith Dirichlet density 
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Notations specific to Chap. 7 (cont.) 


[e]; ith component of a vector 

T Vector of ergodic probabilities 

te Matrix of transition probabilities 

St Discrete state variable estimated at time t 

s Vector of state variables 

Zt Approximate p-score at time t 

Ut Generalized residual at time ¢ 

D(e) Deviance function 

PD Effective number of parameters 

B Number of replications for the block bootstrap 
q(w) Importance density for parameter w 

p(y) Model likelihood 

ply) tth estimate of the model likelihood in the bridge sampling 
poly) Reciprocal sampling estimator of p(y) 
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